Chapter 25 — 🕹️ Reinforcement Learning

📖 All chapters | ← 24 · 🌈 Multimodal AI | 26 · 🛒 Recommender Systems →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Reinforcement learning (RL) is the branch of machine learning where an agent learns to make a sequence of decisions by interacting with an environment, guided only by a scalar reward signal — no labelled examples telling it the right answer, just feedback on how good its choices turned out. It sits apart from supervised and unsupervised learning because the data is generated by the agent’s own behaviour and the consequences of a choice can arrive many steps later. RL powers game-playing systems, robotics, recommendation, and the alignment step (RLHF) behind modern chat models.

🧭 In context: the third paradigm of ML (alongside supervised and unsupervised) · used for sequential decision-making under delayed reward · the one key idea: maximise expected cumulative reward, not immediate accuracy.

💡 Remember this: an RL agent learns by interaction to maximise long-run discounted reward, not immediate correctness — every method in the chapter is a different way to estimate that future value and act on it.

25.1 — Markov Decision Processes

Everything in RL is built on the Markov Decision Process (MDP), a formal model of an agent acting in a world over discrete time steps. At each step the agent observes a state $s$, picks an action $a$, receives a reward $r$, and lands in a new state $s'$. The “Markov” part is a promise: the next state depends only on the current state and action, not the full history. The present screen of the game contains everything you need; how you got there is irrelevant.

Formally an MDP is a tuple $(\mathcal{S}, \mathcal{A}, P, R, \gamma)$:

$\mathcal{S}$ — set of states.
$\mathcal{A}$ — set of actions.
$P(s' \mid s, a)$ — transition probability of reaching $s'$ from $s$ after $a$.
$R(s, a)$ — expected reward for that move.
$\gamma \in [0,1)$ — discount factor.

A policy $\pi(a \mid s)$ is the agent’s strategy: a mapping from states to a distribution over actions. The agent’s goal is to find a policy that maximises the return $G_t$ — the total discounted future reward from time $t$ onward:

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]

In words: the return is all the rewards still to come, but each step further into the future counts a little less than the one before it. Also written: recursively, $G_t = R_{t+1} + \gamma\,G_{t+1}$ — today’s return is the next reward plus a discounted copy of tomorrow’s return.

Discounting does two jobs. Mathematically, $\gamma < 1$ keeps the infinite sum finite. Behaviourally, it expresses preference for sooner rewards: with $\gamma = 0.9$ a reward 10 steps away is worth $0.9^{10} \approx 0.35$ of its face value. A $\gamma$ near 0 makes the agent myopic; near 1 makes it far-sighted.

The agent-environment loop is the heartbeat of RL — a continuous cycle where each action reshapes the world the agent next perceives:

flowchart LR
    A[Agent] -->|action a_t| E[Environment]
    E -->|reward r_t+1| A
    E -->|state s_t+1| A

Watch the signal circulate: the agent fires an action, the world answers with a reward and a new state, and round it goes — a single packet of information looping forever.

We can also picture an MDP as a small graph of states and transitions. Below, a three-state chain: the agent starts at $s_0$ and walks right toward the rewarding terminal $s_2$.

Worked example — a 3-state chain. States $s_0, s_1, s_2$; $s_2$ is terminal with reward 10. From $s_0$ and $s_1$, action right moves you one step forward (reward 0 until the goal). With $\gamma = 0.9$, the return from $s_0$ following right forever is $0 + 0.9 \cdot 0 + 0.9^2 \cdot 10 = 8.1$. From $s_1$ it is $0.9 \cdot 10 = 9.0$. Being closer to reward is worth more — exactly what discounting encodes.

Tip

The Markov property is a modelling choice, not a law of nature. If your state omits relevant history (e.g. velocity in a single frame of Pong), the problem stops being Markov. The fix is to engineer a richer state — stack frames, add memory — until the present is sufficient.

25.2 — Dynamic Programming

Picture planning a road trip when you already own the full map: every road, every distance, every toll. You don’t need to drive around blindly — you can work out the best route at the kitchen table by reasoning backward from the destination. Dynamic programming is that kitchen-table planning for RL.

When you know the MDP — every $P$ and $R$ — you do not need to learn by trial and error; you can compute the optimal policy directly. Dynamic programming (DP) does this using the Bellman equations, which express the value of a state in terms of the values of its successors.

The state-value function $V^\pi(s)$ is the expected return starting from $s$ and following $\pi$. The action-value function $Q^\pi(s,a)$ is the expected return after taking $a$ in $s$, then following $\pi$. The Bellman expectation equation for $V$ is recursive:

\[V^\pi(s) = \sum_a \pi(a \mid s) \sum_{s'} P(s' \mid s,a)\,\big[R(s,a) + \gamma V^\pi(s')\big]\]

In words: the worth of a state is the average — over the actions your policy might pick and the places they might land you — of the reward you get now plus the discounted worth of wherever you end up. Also written: as an expectation, $V^\pi(s) = \mathbb{E}_{a\sim\pi,\,s'\sim P}\big[R(s,a) + \gamma V^\pi(s')\big]$.

The value of now equals the immediate reward plus the discounted value of next — a self-consistency condition. The Bellman optimality equation replaces the policy average with a max:

\[V^*(s) = \max_a \sum_{s'} P(s' \mid s,a)\,\big[R(s,a) + \gamma V^*(s')\big]\]

In words: the best you can do from a state is to pick the single action whose reward-plus-future-value is highest — no averaging over a policy, just take the best move. Also written: $V^*(s) = \max_a Q^*(s,a)$, where $Q^*(s,a) = \sum_{s'} P(s'\mid s,a)[R(s,a)+\gamma V^*(s')]$.

Two classic algorithms solve it:

Algorithm	How it works	Cost per iteration	Convergence
Value iteration	Repeatedly apply the Bellman optimality update to $V$ until it stops changing, then read off the greedy policy	One sweep over states	Fast, no explicit policy step
Policy iteration	Alternate: (1) evaluate the current policy by solving Bellman expectation, (2) improve it greedily; repeat	Evaluation can need many sweeps	Fewer outer iterations

import numpy as np
# Value iteration on the 3-state chain. States 0,1,2(terminal).
P = {0:1, 1:2}          # right: deterministic next state
R = {0:0, 1:10}         # reward on entering next state
gamma = 0.9
V = np.zeros(3)
for _ in range(50):
    Vnew = V.copy()
    for s in [0,1]:
        Vnew[s] = R[s] + gamma*V[P[s]]   # only one action -> no max needed
    V = Vnew
print(V)   # [8.1 9.  0. ]  -> matches the hand calculation

The loop converges because each sweep pulls every state’s estimate closer to its true value — a property called contraction. Picture the estimate as a ball: every sweep nudges it a fixed fraction of the way toward the true value, so the gap shrinks geometrically and the ball settles onto the dashed target line.

Value and policy iteration are the two faces of the same idea, sketched below.

flowchart TD
    subgraph VI[Value iteration]
        V1[Update V with max] --> V2{Stable?}
        V2 -->|no| V1
        V2 -->|yes| V3[Read greedy policy]
    end
    subgraph PI[Policy iteration]
        P1[Evaluate policy] --> P2[Improve greedily]
        P2 --> P3{Policy changed?}
        P3 -->|yes| P1
        P3 -->|no| P4[Optimal]
    end

Warning

DP needs the full model and scales with $|\mathcal{S}|\times|\mathcal{A}|$. For a game with $10^{40}$ states it is hopeless — this is exactly why the rest of the chapter turns to learning methods that only need samples, not the transition table.

25.3 — Monte Carlo & Temporal-Difference Learning

When the model is unknown, the agent must estimate values from experience. Two philosophies split here, and the contrast is the conceptual core of RL.

Monte Carlo (MC) learning waits for an episode to finish, then uses the actual observed return $G_t$ as the target. To estimate $V(s)$, average the returns that followed every visit to $s$:

\[V(s) \leftarrow V(s) + \alpha\,\big[G_t - V(s)\big]\]

In words: nudge your current guess for the value of $s$ a small step ($\alpha$) toward what actually happened ($G_t$), the real total reward you collected after visiting it. Also written: the running-average form, $V(s) \leftarrow (1-\alpha)V(s) + \alpha\,G_t$ — a weighted blend of the old estimate and the new sample.

It is unbiased (the return is the real thing) but high-variance and only works for episodic tasks — you must reach the end to learn anything.

Temporal-Difference (TD) learning does not wait. It bootstraps: it updates toward an estimate built from the next step’s value, not the final return. The simplest version, TD(0), uses one real reward plus the discounted estimate of where it landed:

\[V(s_t) \leftarrow V(s_t) + \alpha\,\underbrace{\big[r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\big]}_{\text{TD error }\delta_t}\]

In words: don’t wait for the episode to end — after one step, compare what you expected ($V(s_t)$) with one real reward plus your guess for where you landed, and step toward that. Also written: $V(s_t) \leftarrow V(s_t) + \alpha\,\delta_t$ with $\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$; the MC update is the same shape with the bootstrap term $r_{t+1}+\gamma V(s_{t+1})$ swapped for the full return $G_t$.

The TD error $\delta_t$ is the surprise: how much better or worse things turned out than expected. TD learns online, step by step, with lower variance but some bias.

TD(λ) interpolates between the two. Instead of looking one step ahead (TD) or all the way to the end (MC), it blends $n$-step returns with geometrically decaying weights controlled by $\lambda \in [0,1]$. $\lambda=0$ gives TD(0); $\lambda=1$ gives MC. Eligibility traces make this efficient online by keeping a fading memory of recently visited states, so a surprise now can update them all at once.

The whole MC↔︎TD spectrum is really one knob — how many real reward steps you trust before falling back on a bootstrap guess. The general $n$-step return makes the knob explicit:

\[G_t^{(n)} = r_{t+1} + \gamma r_{t+2} + \cdots + \gamma^{n-1} r_{t+n} + \gamma^n V(s_{t+n})\]

In words: take $n$ real rewards as they actually came in, then stitch on your value estimate for wherever you ended up after those $n$ steps. Also written: $n=1$ recovers the TD(0) target $r_{t+1}+\gamma V(s_{t+1})$; $n\to\infty$ (run to the end) recovers the Monte-Carlo return $G_t$. TD(λ) is just the geometric average $G_t^\lambda = (1-\lambda)\sum_{n\ge 1}\lambda^{n-1}G_t^{(n)}$ of all of these.

flowchart TD
    S[Update target for V s_t] --> M{Method}
    M -->|MC| MC[Wait for full return G_t<br/>unbiased, high variance]
    M -->|TD 0| TD[r + gamma V s_t+1<br/>bootstrap, low variance]
    M -->|TD lambda| L[blend of n-step returns]

The picture below contrasts how far each method looks before updating: TD(0) peeks one step ahead, MC waits for the whole episode, TD(λ) sits in between.

Worked example. Suppose $V(s_1)=5$, the agent takes a step, gets $r=0$, and lands in $s_2$ with $V(s_2)=9$. With $\alpha=0.1, \gamma=0.9$: TD error $\delta = 0 + 0.9\cdot 9 - 5 = 3.1$, so $V(s_1) \leftarrow 5 + 0.1\cdot 3.1 = 5.31$. The estimate nudges toward 8.1 (its true value) without ever finishing an episode.

Tip

Rule of thumb: TD usually learns faster and works in continuing (non-terminating) tasks, which is why nearly all modern RL is TD-based. MC shines when episodes are short and you distrust your own value estimates.

25.4 — Q-Learning & SARSA

Estimating $V$ alone is not enough to act without a model — you would need $P$ to look ahead. Estimating $Q(s,a)$ sidesteps this: the best action is simply $\arg\max_a Q(s,a)$. Two TD control algorithms learn $Q$ from experience, and they differ in one subtle but crucial way.

SARSA is on-policy: it learns the value of the policy it actually follows, including its exploration. The name is the tuple it uses — $(s, a, r, s', a')$:

\[Q(s,a) \leftarrow Q(s,a) + \alpha\,\big[r + \gamma\,Q(s',a') - Q(s,a)\big]\]

In words: update the value of the move you just made toward the reward you got plus the value of the next move you will really make — exploration included. Also written: $Q(s,a) \leftarrow (1-\alpha)\,Q(s,a) + \alpha\,[\,r + \gamma\,Q(s',a')\,]$.

Here $a'$ is the action the agent actually takes next.

Q-learning is off-policy: it learns the value of the optimal policy regardless of what it does to explore, by using a max over next actions:

\[Q(s,a) \leftarrow Q(s,a) + \alpha\,\big[r + \gamma\,\max_{a'} Q(s',a') - Q(s,a)\big]\]

In words: update toward the reward plus the value of the best move available next, assuming you’ll act greedily from then on — even if your exploration actually does something dumber. Also written: since $V^*(s') = \max_{a'} Q(s',a')$, the target is simply $r + \gamma\,V^*(s')$ — a sampled Bellman optimality backup.

The single difference — $Q(s',a')$ vs $\max_{a'} Q(s',a')$ — gives them different personalities. SARSA accounts for the cost of exploration and tends to learn safer paths; Q-learning learns the optimal path assuming it will eventually act greedily, even if that path is risky during training. The classic illustration is cliff walking: a grid where a shortcut runs along the edge of a cliff (a big negative reward if you fall in). SARSA, feeling the occasional random plunge during exploration, learns a cautious route one row back from the edge; Q-learning, valuing the optimal greedy path, hugs the cliff.

import numpy as np, random
# Q-learning on the 3-state chain, model-free, epsilon-greedy.
Q = np.zeros((3,1))            # one action, so trivial argmax; shown for shape
alpha, gamma, eps = 0.1, 0.9, 0.1
def step(s):                  # env: 'right' from 0->1 (r0), 1->2 (r10)
    return (s+1, 10 if s==1 else 0)
for ep in range(200):
    s = 0
    while s != 2:
        a = 0
        s2, r = step(s)
        target = r + gamma*(0 if s2==2 else Q[s2].max())
        Q[s,a] += alpha*(target - Q[s,a])
        s = s2
print(Q.ravel())   # ~[8.1, 9.0, 0.0]

Warning

Q-learning’s max introduces maximisation bias: taking the max over noisy estimates systematically overestimates value. Double Q-learning (two separate $Q$ tables, one picks the action, the other scores it) is the standard fix, later carried into Double DQN.

25.5 — Deep Q-Networks

Tabular $Q$ needs one cell per state-action pair — impossible for raw pixels. A Deep Q-Network (DQN) replaces the table with a neural network $Q(s,a;\theta)$ that takes a state and outputs a $Q$-value per action. DeepMind’s 2015 Atari result — one network learning dozens of games from pixels and score alone — made this famous. But naively bolting a network onto Q-learning is unstable, for two reasons that DQN fixes with two tricks.

Experience replay. Consecutive transitions are highly correlated, and gradient descent hates correlated data. DQN stores transitions $(s,a,r,s')$ in a replay buffer and trains on random minibatches sampled from it. This breaks correlation and reuses each experience many times — far more sample-efficient.

Target network. The Q-learning target $r + \gamma \max_{a'} Q(s',a';\theta)$ uses the same weights $\theta$ we are updating, so the target moves the instant we take a step — like chasing your own shadow. DQN keeps a frozen copy $\theta^-$ for the target, synced to $\theta$ only every few thousand steps. The loss minimised is:

\[L(\theta) = \mathbb{E}_{(s,a,r,s')\sim \mathcal{D}}\Big[\big(r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta)\big)^2\Big]\]

In words: train the network so its predicted $Q$ for the action you took matches the reward-plus-best-next-value computed by a frozen copy of itself; minimise the squared gap, averaged over random past experiences. Also written: $L(\theta) = \mathbb{E}\big[\delta^2\big]$ with TD error $\delta = y - Q(s,a;\theta)$ and target $y = r + \gamma\max_{a'}Q(s',a';\theta^-)$ (often Huber-clipped instead of squared for stability).

flowchart LR
    E[Env] -->|s,a,r,s'| B[(Replay buffer)]
    B -->|random minibatch| Q[Online net theta]
    B -->|s'| T[Target net theta-]
    T -->|max Q| L[TD loss]
    Q --> L
    L -->|SGD| Q
    Q -.->|copy every N steps| T

A single DQN gradient step in PyTorch is exactly the loss above, with the target detached so gradients never flow through the frozen network:

import torch, torch.nn.functional as F
# online: Q-network theta; target: frozen copy theta-. batch tensors s,a,r,s2,done.
q      = online(s).gather(1, a.unsqueeze(1)).squeeze(1)        # Q(s,a; theta)
with torch.no_grad():
    next_q = target(s2).max(dim=1).values                     # max_a' Q(s',a'; theta-)
    y      = r + gamma * next_q * (1 - done)                   # bootstrap target
loss = F.smooth_l1_loss(q, y)                                 # Huber on the TD error
opt.zero_grad(); loss.backward(); opt.step()
# every N steps: target.load_state_dict(online.state_dict())  # sync theta- <- theta

In practice you rarely write this by hand. Stable-Baselines3 ships a tuned DQN (and PPO, A2C, SAC) you can train on any Gymnasium environment in a few lines:

import gymnasium as gym
from stable_baselines3 import DQN
env   = gym.make("CartPole-v1")
model = DQN("MlpPolicy", env, buffer_size=50_000, target_update_interval=500, verbose=0)
model.learn(total_timesteps=50_000)        # replay buffer + target net handled internally
obs, _ = env.reset()
action, _ = model.predict(obs, deterministic=True)

A family of refinements stacks on top of vanilla DQN, each fixing one weakness; Rainbow is the well-known combination that fuses all of them and beats every piece alone:

Extension	Problem it fixes	One-line idea
Double DQN	maximisation bias (overestimated $Q$)	online net picks the next action, target net scores it
Dueling DQN	many actions share a state’s value	split the head into a state-value $V(s)$ and an advantage $A(s,a)$
Prioritised replay	rare, high-error transitions seen too seldom	sample from the buffer in proportion to TD-error size
Noisy nets	$\varepsilon$-greedy explores blindly	learnable noise in the weights gives state-dependent exploration

Tip

The two tricks attack the same disease — instability — from different angles: replay decorrelates inputs, the target network stabilises outputs. Remove either and DQN often diverges. Most later improvements (Double DQN, Dueling, Prioritised Replay, Rainbow) keep both and refine the details.

25.6 — Policy Gradient & Actor-Critic

DQN learns a value, then acts greedily — it never represents a policy directly. Policy gradient methods do the opposite: they parameterise the policy $\pi_\theta(a \mid s)$ (e.g. a network outputting action probabilities) and adjust $\theta$ via gradient ascent to push up expected return. This handles continuous actions naturally and can learn genuinely stochastic policies.

The foundational result is the policy gradient theorem. Its simplest estimator, REINFORCE, scales the gradient of the log-probability of each action by the return that followed it:

\[\nabla_\theta J(\theta) = \mathbb{E}_\pi\big[\,\nabla_\theta \log \pi_\theta(a_t \mid s_t)\, G_t\,\big]\]

In words: push the knobs of the policy so that whatever action you took becomes more probable in proportion to how much reward followed it — good outcomes upvote their action, bad outcomes downvote it. Also written: as a sum over a trajectory, $\nabla_\theta J(\theta) \approx \sum_t \big(\nabla_\theta \log \pi_\theta(a_t\mid s_t)\big)\,G_t$ — the “log-likelihood times reward” estimator.

Intuition: make actions that led to high return more likely, low return less likely. But $G_t$ has huge variance. The cure is to subtract a baseline — typically a learned value estimate $V(s)$ — giving the advantage $A_t = G_t - V(s_t)$: “how much better than expected was this action?”

Worked example — why the baseline matters. In a state the policy gives two actions probability 0.5 each. Both happen to return a large positive number — say $G=100$ and $G=98$. Raw REINFORCE multiplies each log-prob gradient by $\approx 100$, so it shoves both actions up hard and learns almost nothing about which is better — the signal is drowned by the shared offset. Now subtract a baseline $V(s)=99$: the advantages become $A=+1$ and $A=-1$. The first action is gently up-voted, the second down-voted — same outcomes, but now the update reflects the relative quality, which is all that matters. The baseline does not change the gradient’s average direction; it only strips out the noisy common offset.

That pairing is actor-critic. The actor is the policy $\pi_\theta$; the critic is the value estimator $V_\phi$ that supplies the baseline and bootstraps the return. A2C (Advantage Actor-Critic) is the clean synchronous version. The two halves form a tight feedback loop:

flowchart LR
    S[State s] --> AC[Actor pi_theta]
    AC -->|action a| ENV[Environment]
    ENV -->|reward r, next state| CR[Critic V_phi]
    CR -->|advantage A| AC
    CR -->|TD error| CR2[Update critic]

PPO (Proximal Policy Optimization) is today’s workhorse — stable, simple to tune, and the algorithm behind RLHF (training chat models from human preference rewards). Its insight: a big policy update can be catastrophic, so PPO clips the update so the new policy never moves too far from the old one in a single step. Concretely, it limits the probability ratio $r_t(\theta) = \pi_\theta(a_t \mid s_t) / \pi_{\theta_\text{old}}(a_t \mid s_t)$ to a narrow band around 1, so one noisy batch cannot wreck a good policy. The clipped objective it maximises is:

\[L^{\text{CLIP}}(\theta) = \mathbb{E}_t\Big[\min\big(r_t(\theta)\,A_t,\ \operatorname{clip}(r_t(\theta),\,1-\epsilon,\,1+\epsilon)\,A_t\big)\Big]\]

In words: improve the policy in the direction the advantage points, but if the new policy strays more than a fraction $\epsilon$ (say 20%) from the old one, stop giving credit for going further — the clip removes the incentive to take a reckless leap. Also written: with the unclipped term $u_t = r_t(\theta)A_t$ and clipped term $c_t = \operatorname{clip}(r_t,1-\epsilon,1+\epsilon)A_t$, the objective is just $\mathbb{E}_t[\min(u_t,c_t)]$ — a pessimistic (lower) bound on the improvement.

Think of the clip as a fence. The policy ratio is free to move, but the moment it tries to wander past $1\pm\epsilon$ the gradient flattens — there’s no more reward for straying, so the update gets pulled back inside the safe band.

Method	Learns	Variance	Notes
REINFORCE	policy only	high	simplest, no critic
A2C	policy + value	medium	advantage baseline
PPO	policy + value	low	clipped updates; default for RLHF

import numpy as np
# REINFORCE gradient for one trajectory (pseudo, terms only).
# grad = sum_t  grad_log_pi(a_t|s_t) * (G_t - V(s_t))
def reinforce_grad(grad_logps, returns, baselines):
    adv = np.array(returns) - np.array(baselines)
    return sum(g*a for g, a in zip(grad_logps, adv))   # advantage-weighted

Training PPO for real is a one-liner with Stable-Baselines3 — the clipping, advantage estimation (GAE), and multi-epoch reuse are all built in:

import gymnasium as gym
from stable_baselines3 import PPO
model = PPO("MlpPolicy", gym.make("LunarLander-v3"),
            clip_range=0.2, n_epochs=10, gae_lambda=0.95, verbose=0)
model.learn(total_timesteps=200_000)

For RLHF on language models, the same PPO loop lives inside Hugging Face TRL’s PPOTrainer, where the “environment” is text generation and the reward comes from a learned preference model rather than a game score.

Warning

Policy gradients are on-policy: each batch of data is valid only for the policy that generated it, so you cannot freely reuse old experience as DQN does. PPO stretches this with a few epochs of reuse per batch, but throughput, not sample count, is usually its bottleneck.

25.7 — Generalized Advantage Estimation

The advantage $A_t = G_t - V(s_t)$ from the last section hides one of the most consequential dials in modern policy-gradient RL — and the SB3 code above silently set it (gae_lambda=0.95). The question it answers: how do you estimate the advantage? It is the exact MC-vs-TD tension from Section 25.3, now wearing a different hat.

Intuition. You can measure “how much better than expected was this action” two ways. Use the full Monte-Carlo return and the estimate is honest (unbiased) but jittery (high variance) — one lucky reward twenty steps later swings it. Use a one-step TD bootstrap and it is steady (low variance) but leans on your possibly-wrong value estimate (biased). Generalized Advantage Estimation (GAE) dials smoothly between the two, just as TD(λ) did for value learning.

The building block is the one-step TD error of the critic, $\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)$. GAE sums these errors into the future with an exponential decay $\lambda$:

\[\hat{A}_t^{\text{GAE}(\gamma,\lambda)} = \sum_{l=0}^{\infty} (\gamma\lambda)^l\,\delta_{t+l}\]

In words: the advantage is a discounted running total of every “surprise” from now onward — recent surprises count fully, distant ones fade — with $\lambda$ controlling how far into the future the surprises still matter. Also written: recursively (the form actually coded), $\hat{A}_t = \delta_t + \gamma\lambda\,\hat{A}_{t+1}$ — today’s advantage is today’s TD error plus a decayed copy of tomorrow’s advantage.

The two extremes recover old friends: $\lambda = 0$ gives $\hat{A}_t = \delta_t$, the pure one-step (low-variance, biased) advantage; $\lambda = 1$ gives $\hat{A}_t = G_t - V(s_t)$, the full Monte-Carlo (unbiased, high-variance) advantage. Values like $\lambda = 0.95$ sit just shy of MC — keeping most of the unbiasedness while shaving off the worst of the variance — which is why nearly every PPO and A2C run uses it.

import numpy as np
# GAE-lambda from a rollout: rewards r, values V (len T+1, V[-1] bootstraps), masks.
def gae(r, V, gamma=0.99, lam=0.95, done=None):
    T = len(r); A = np.zeros(T); gae_t = 0.0
    done = np.zeros(T) if done is None else done
    for t in reversed(range(T)):
        nonterminal = 1.0 - done[t]
        delta = r[t] + gamma * V[t+1] * nonterminal - V[t]   # one-step TD error
        gae_t = delta + gamma * lam * nonterminal * gae_t     # recursive accumulation
        A[t]  = gae_t
    returns = A + V[:T]                                        # critic target = advantage + baseline
    return A, returns

Tip

GAE is the single most impactful “free” tuning knob in actor-critic RL. If PPO is learning but unstable, nudge $\lambda$ down (less variance, more bias); if it is stable but slow and biased, nudge it up. The $\gamma\lambda$ product, not $\lambda$ alone, sets the effective horizon of credit assignment.

25.8 — Exploration vs Exploitation & Bandits

Every RL agent faces the same dilemma: exploit the best-known action to earn reward now, or explore an uncertain action that might be better. Always exploiting locks in early mistakes; always exploring never cashes in. The cleanest setting to study this is the multi-armed bandit — a one-state MDP, just $k$ slot-machine arms with unknown payout distributions, where you maximise total reward over many pulls.

Three strategies dominate:

ε-greedy. With probability $\varepsilon$ pull a random arm, otherwise pull the current best. Dead simple; the flaw is that exploration is undirected — it wastes pulls on obviously-bad arms as often as on promising ones.

UCB (Upper Confidence Bound). Be optimistic under uncertainty: pick the arm with the highest upper confidence bound, rewarding both high estimated value and high uncertainty.

\[a_t = \arg\max_i \left[\hat{\mu}_i + c\sqrt{\frac{\ln t}{N_i}}\right]\]

In words: pick the arm with the best optimistic estimate — its average payoff so far plus a “could be better than it looks” bonus that is big when you’ve barely tried it. Also written: $a_t = \arg\max_i [\hat\mu_i + \text{bonus}_i]$ where $\text{bonus}_i = c\sqrt{\ln t / N_i}$ shrinks like $1/\sqrt{N_i}$ as the arm is sampled more.

Arms tried rarely ($N_i$ small) get a large bonus, so they get a fair trial; as $N_i$ grows the bonus shrinks and UCB settles on the true best.

Thompson sampling. Keep a posterior distribution over each arm’s value, sample one value per arm, and pull the arm with the highest sample. Exploration emerges naturally from posterior uncertainty — wide posteriors sometimes sample high. It is Bayesian, elegant, and empirically often the strongest.

import numpy as np
# UCB on a 3-arm bandit.
true = [0.2, 0.5, 0.75]; k = 3
N = np.zeros(k); mu = np.zeros(k)
for t in range(1, 2001):
    bonus = np.sqrt(2*np.log(t)/np.where(N==0, 1e-9, N))
    a = np.argmax(mu + bonus)                  # optimism
    r = 1.0 if np.random.rand() < true[a] else 0.0
    N[a] += 1; mu[a] += (r - mu[a])/N[a]       # running mean
print(N)   # most pulls land on arm 2 (the best)

Tip

Bandits are not just a toy: they run A/B testing, ad selection, and clinical trials. When there is genuinely one state (no consequences carrying forward), reach for a bandit algorithm before a full RL stack — it converges far faster.

25.9 — Multi-Agent RL

So far the environment was a fixed, indifferent world. In multi-agent RL (MARL) several learning agents share an environment, and each one’s reward depends on what the others do. This breaks a core assumption: from any single agent’s view the environment is non-stationary, because the other agents are also changing their policies as they learn. A strategy that was optimal yesterday can be terrible today — the target keeps moving.

MARL spans a spectrum of incentive structures:

Cooperative — agents share a reward and must coordinate (a team of warehouse robots).
Competitive — one’s gain is another’s loss, the zero-sum case (chess, poker).
Mixed — partly aligned, partly opposed (autonomous cars merging: nobody wants a crash, everybody wants to go first).

flowchart TD
    R[Multi-agent settings] --> C[Cooperative<br/>shared reward]
    R --> M[Mixed<br/>partial conflict]
    R --> Z[Competitive<br/>zero-sum]
    Z --> SP[Self-play]

The most powerful tool here is self-play: an agent improves by playing against copies of itself. As it gets stronger, so does its opponent, producing an automatic curriculum that scales from random flailing to superhuman without any external teacher. Self-play drove the breakthroughs in Go, chess, StarCraft, and Dota.

Warning

Naive independent learners (each running its own DQN, treating others as part of the environment) often fail to converge in MARL precisely because of non-stationarity. Techniques like centralised training with decentralised execution (CTDE) — share information while learning, act independently at test time — are the practical workaround.

25.10 — Landmark Systems: AlphaGo, AlphaZero, MuZero

DeepMind’s Alpha series is the showcase for how the pieces in this chapter combine, and it traces a clean arc of removing human input and removing the model.

AlphaGo (2016) beat world champion Lee Sedol at Go, a game long thought a decade away. It combined three ingredients: a policy network and value network (deep nets predicting good moves and who is winning), bootstrapped from human expert games, then refined by self-play, all steered at play-time by Monte Carlo Tree Search (MCTS) — a lookahead that uses the networks to focus the search on promising lines.

AlphaZero (2017) dropped the human games entirely. Starting from random play and knowing only the rules, it learned Go, chess, and shogi from pure self-play, a single network outputting both policy and value, trained by MCTS-guided self-play. It surpassed AlphaGo and the best classical chess engines — a striking demonstration that human knowledge was a crutch, not a requirement.

MuZero (2019) removed the last given: the rules themselves. It learns its own internal model of the environment’s dynamics — just enough to plan — without ever being told how the game works. The same algorithm masters board games and Atari from pixels, planning in a learned latent space.

flowchart LR
    AG[AlphaGo<br/>human games + self-play + MCTS] --> AZ[AlphaZero<br/>self-play only, knows rules]
    AZ --> MZ[MuZero<br/>learns the model too]

The unifying engine is MCTS guided by learned networks: the value network supplies the bootstrapped estimate (Section 25.3), the policy network biases exploration the way UCB does (Section 25.8), and self-play provides the non-stationary curriculum (Section 25.9). Every theme in the chapter shows up at once.

Tip

The throughline is removal: AlphaGo removed the handcrafted evaluation, AlphaZero removed human data, MuZero removed the known model. Each step generalises the system to more problems — the practical payoff of methods that learn from interaction rather than instruction.

25.11 — Imitation learning: learning to act by watching

The cleanest way to teach a skill is often to demonstrate it. A driving instructor does not hand you a reward function and tell you to maximize discounted future safety; they drive, you watch, you copy. Imitation learning formalizes that idea: instead of learning from a scalar reward signal, the agent learns from a dataset of expert trajectories — sequences of states paired with the actions the expert took there.

This sidesteps the single hardest part of reinforcement learning: designing a reward. For tasks like “drive like a careful human” or “fold a towel,” writing down a reward that captures what you actually want is brutally hard, but producing demonstrations is easy. The cost reappears elsewhere, as we will see, but for many problems the trade is worth it.

Behavioral cloning: imitation as supervised learning

The most direct approach treats imitation as plain supervised learning. Collect a dataset $\mathcal{D} = \{(s_i, a_i)\}$ from the expert, then fit a policy $\pi_\theta(a \mid s)$ to predict the expert’s action at each state — exactly as you would train a classifier (discrete actions) or a regressor (continuous actions). This is behavioral cloning (BC).

Tip

Behavioral cloning is the “just memorize what the expert did” baseline. No environment interaction, no reward, no rollouts during training — one pass of supervised learning over logged demonstrations. When it works, it is by far the cheapest option.

The objective is a standard maximum-likelihood (or regression) loss:

\[ \theta^\star = \arg\max_\theta \sum_{(s,a)\in\mathcal{D}} \log \pi_\theta(a \mid s). \]

The compounding-error problem

Behavioral cloning has a subtle, fatal flaw, and it comes from a broken assumption. Supervised learning assumes the training and test inputs are drawn from the same distribution. But a policy generates its own future states. The states the cloned policy visits depend on the actions it takes, and those actions are imperfect — so the moment the agent makes a small error, it nudges itself into a state slightly off the expert’s path. That new state was rare or absent in the demonstrations, so the policy is even less sure what to do, makes a larger error, drifts further off-distribution, and so on. Errors compound.

The classic analysis makes the damage precise. If the cloned policy disagrees with the expert with probability at most $\epsilon$ on states from the expert’s distribution, the expected extra cost over a horizon $T$ grows like

\[ \text{regret} = O(\epsilon\, T^2), \]

not the $O(\epsilon T)$ you would naively hope for. The extra factor of $T$ is the cost of leaving the distribution: one mistake puts you in a region where mistakes are more likely, and there is no correction signal pulling you back.

You can feel this in a car: clone a human driver who always stays centered in the lane, and the demonstrations contain no examples of recovering from the shoulder — because the expert never went there. The first time the cloned policy drifts toward the shoulder, it has no idea how to steer back, because that situation never appeared in training.

DAgger: let the expert label the states you actually visit

The fix follows directly from the diagnosis. The problem is a mismatch between the states the expert visited and the states the learner visits. So gather labels on the learner’s own state distribution. DAgger (Dataset Aggregation) does exactly this, iteratively:

flowchart LR
    A[Train policy π on dataset D] --> B[Roll out π in the environment]
    B --> C[Collect states π actually visits]
    C --> D[Ask expert for correct action at each]
    D --> E[Aggregate into D]
    E --> A

Train $\pi$ on the current dataset $\mathcal{D}$ by behavioral cloning.
Run $\pi$ in the environment to collect the states it actually visits — including the off-distribution ones.
Query the expert for the correct action at each of those states.
Add these new (state, expert-action) pairs to $\mathcal{D}$ and repeat.

After a few iterations the dataset covers the recovery situations the learner stumbles into, and the policy learns to climb back onto the expert’s path. DAgger provably reduces regret from $O(\epsilon T^2)$ back to $O(\epsilon T)$. The catch is the requirement of an interactive expert — a human (or oracle) who can be queried for the right action at arbitrary, possibly weird states. That is exactly the towel-half-folded-and-twisted state nobody wants to label, and it is the practical bottleneck of DAgger.

Inverse reinforcement learning: recover the reward, not the actions

Behavioral cloning copies what the expert did. Inverse reinforcement learning (IRL) asks the deeper question: why did the expert do it — what reward were they optimizing? If you can recover that reward $R$, you can then run ordinary RL against it and obtain a policy that generalizes to new states, new dynamics, even new goals, rather than parroting a fixed mapping.

The intuition: among all reward functions, find one under which the expert’s behavior looks optimal (or near-optimal). The trouble is this is badly under-determined — the all-zero reward makes every policy optimal, so the expert is trivially “optimal” under it. IRL needs a principle to break the tie.

Maximum-entropy IRL is the dominant answer. It models the expert as noisily-rational: the probability of a trajectory $\tau$ is proportional to the exponential of its total reward,

\[ p(\tau) \propto \exp\!\Big(\sum_{t} R_\theta(s_t, a_t)\Big), \]

so higher-reward trajectories are exponentially more likely, but suboptimal ones are not forbidden. Fitting $R_\theta$ by maximum likelihood picks the reward that explains the demonstrations while being maximally noncommittal (highest entropy) about everything else — the principled tie-breaker. Modern variants (e.g. adversarial IRL) learn $R_\theta$ with a neural network using a GAN-like setup, where a discriminator separating expert from policy trajectories implicitly defines the reward.

Warning

IRL is far more expensive than behavioral cloning: the classic formulation runs a full RL solve in an inner loop for every candidate reward, and the recovered reward is only identifiable up to shaping transformations. Reach for IRL when you need a reward that transfers to new situations; if you only need to mimic the expert in the same setting, BC or DAgger is much cheaper.

Method	Needs reward?	Needs env interaction?	Needs interactive expert?	Fixes compounding error?	Transfers to new settings?
Behavioral cloning	no	no	no	no	no
DAgger	no	yes	yes	yes	no
Inverse RL	learns it	yes	no	yes (via re-solve)	yes

25.12 — Bandits and exploration, made precise

Strip a reinforcement learning problem down until there is only one state, and what remains is a multi-armed bandit. You face $K$ slot machines (“arms”); pulling arm $a$ returns a reward drawn from an unknown distribution with mean $\mu_a$. Each round you pick one arm, see its reward, and want to maximize total reward over $n$ rounds. There is no state transition to reason about — only the pure tension between exploiting the arm that looks best so far and exploring arms you haven’t sampled enough to trust.

This stripped-down setting is where the explore–exploit trade-off can be analyzed exactly, and the algorithms that fall out (UCB, Thompson sampling) reappear, lightly modified, throughout full RL.

Regret: the right way to keep score

Raw cumulative reward is hard to interpret because it depends on the unknown means. The clean scoreboard is regret: how much worse you did than an oracle that knew the best arm $a^\star$ (with mean $\mu^\star = \max_a \mu_a$) and pulled it every round.

\[ \text{Regret}(n) = n\,\mu^\star - \mathbb{E}\!\left[\sum_{t=1}^{n} \mu_{a_t}\right] = \sum_{a} \Delta_a\, \mathbb{E}[N_a(n)], \]

In words: regret is the reward you left on the table — what a know-it-all who always pulled the best arm would have earned, minus what you actually earned. Also written: the rightmost form $\sum_a \Delta_a\,\mathbb{E}[N_a(n)]$ says total regret is “how bad each arm is” times “how often you pulled it,” summed over arms.

where $\Delta_a = \mu^\star - \mu_a$ is the gap of arm $a$ and $N_a(n)$ is how many times you pulled it. The second form is the key insight: regret is just the sum, over suboptimal arms, of how bad each arm is times how often you pulled it. Good algorithms keep $N_a(n)$ small for high-gap arms.

The gold standard is sublinear regret: $\text{Regret}(n)/n \to 0$, meaning your average reward converges to the optimum. The best achievable rate is logarithmic, $O(\log n)$ — you can never stop exploring entirely, but you can explore ever more rarely. Any algorithm with linear regret (a constant fraction of pulls wasted forever) has failed.

Three strategies

$\epsilon$-greedy is the blunt instrument. With probability $1-\epsilon$ pull the arm with the best estimated mean; with probability $\epsilon$ pull a uniformly random arm. Dead simple, and it works — but a fixed $\epsilon$ wastes a constant fraction of pulls on exploration forever, giving linear regret. You must decay $\epsilon_t \sim 1/t$ to recover the $O(\log n)$ rate, and even then it explores indiscriminately, wasting pulls on arms it should already know are bad.

UCB (Upper Confidence Bound) is smarter: be optimistic in the face of uncertainty. For each arm keep its empirical mean $\hat\mu_a$ plus a bonus that is large when the arm is under-sampled:

\[ \text{UCB}_a(t) = \hat\mu_a + \sqrt{\frac{2\ln t}{N_a(t)}}, \]

and pull the arm with the highest UCB. The bonus shrinks as $N_a$ grows, so a rarely-pulled arm gets the benefit of the doubt — you pull it because you are uncertain, not at random. An arm gets dropped only once enough samples confirm its mean really is low. UCB achieves the optimal $O(\log n)$ regret and explores in a targeted way.

Thompson sampling is the Bayesian approach, and often the best in practice. Keep a posterior distribution over each arm’s mean. Each round, sample one plausible mean from each arm’s posterior and pull the arm with the highest sampled value; then update that arm’s posterior with the observed reward. Arms with wide posteriors (little data) sometimes sample high and get explored; as data accumulates, posteriors sharpen and exploration fades automatically — “probability matching” rather than an explicit bonus.

flowchart TB
    subgraph eps["ε-greedy"]
        E1[best arm w.p. 1−ε] --- E2[random arm w.p. ε]
    end
    subgraph ucb["UCB"]
        U1[score = mean + uncertainty bonus] --- U2[pull highest score]
    end
    subgraph ts["Thompson"]
        T1[sample mean from each posterior] --- T2[pull highest sample]
    end

Worked example: three arms, by hand

Three coin-flip arms with true means $\mu = (0.2,\ 0.5,\ 0.8)$; arm 3 is best. Suppose after some play the counts and empirical means are:

Arm	$N_a$	$\hat\mu_a$	bonus $\sqrt{2\ln t / N_a}$ at $t=50$	UCB
1	20	0.25	$\sqrt{2\cdot3.91/20}=0.626$	0.876
2	20	0.45	0.626	1.076
3	10	0.70	$\sqrt{2\cdot3.91/10}=0.884$	1.584

UCB pulls arm 3: it has both the best mean and the largest bonus (fewest pulls), so optimism and exploitation agree. Now contrast with $\epsilon$-greedy at $\epsilon=0.1$: it pulls arm 3 (highest $\hat\mu$) 90% of the time but burns the other 10% split evenly between arms 1, 2, and 3 — including pulls of arm 1, which it already has strong evidence is terrible. That wasted exploration of a clearly-bad arm is exactly what UCB and Thompson avoid.

import numpy as np
rng = np.random.default_rng(0)
mu = np.array([0.2, 0.5, 0.8])           # true means, arm 3 best
K, n = 3, 2000
counts = np.zeros(K); sums = np.zeros(K)
regret = 0.0; mu_star = mu.max()
for t in range(1, n+1):
    if (counts == 0).any():              # pull each arm once first
        a = int(np.argmin(counts))
    else:
        ucb = sums/counts + np.sqrt(2*np.log(t)/counts)
        a = int(np.argmax(ucb))
    r = rng.random() < mu[a]             # Bernoulli reward
    counts[a] += 1; sums[a] += r
    regret += mu_star - mu[a]            # gap incurred this round
print(counts)        # ~ most pulls on arm 3
print(round(regret)) # grows like log(n), not linearly

Warning

A fixed exploration rate is the most common bandit mistake: constant $\epsilon$ never stops paying the exploration tax, so its regret grows linearly and average reward never reaches the optimum. Either decay $\epsilon$ over time or use an uncertainty-aware method (UCB / Thompson) whose exploration shrinks automatically as evidence accumulates.

25.13 — POMDPs: acting when you can’t see the whole state

Every method so far quietly assumed you can see the true state. Reality rarely cooperates. A robot’s camera is occluded, a medical test is noisy, a poker opponent’s cards are hidden. A partially observable Markov decision process (POMDP) is the honest model: the world still has a true state $s$ that evolves as a Markov process, but you never observe it directly. Instead, after each action you receive an observation $o$ drawn from $O(o \mid s', a)$ — a noisy, partial clue about where you ended up.

The defining difficulty: the optimal action can depend on the entire history of past actions and observations, not just the latest clue. A single sensor reading “beep” might mean “target is left” or “target is right” depending on everything that came before. You cannot act well by reacting to the last observation alone.

The belief state: a sufficient statistic for history

The key move that rescues tractability: you do not need to remember the raw history. You only need the belief state $b$ — a probability distribution over the true states, $b(s) = \Pr[\text{state} = s \mid \text{history}]$. The belief is a sufficient statistic: it compresses everything the history tells you about where you are. Two different histories that induce the same belief call for the same action.

Each step, the belief is updated by Bayes’ rule. After taking action $a$ and seeing observation $o$:

\[ b'(s') = \eta\; O(o \mid s', a) \sum_{s} T(s' \mid s, a)\, b(s), \]

In words: your new belief that you’re in state $s'$ = (how likely the dynamics would have carried you there) × (how likely that state would have produced the clue you just saw), then rescaled so all the beliefs add to 1. Also written: $b' \propto O_a \odot (T_a\, b)$ — element-wise multiply the predicted belief $T_a b$ by the observation likelihoods $O_a$, then normalize.

where $\eta$ normalizes. This is exactly the predict-then-correct loop of a Bayes filter: the sum predicts the new state from dynamics, and multiplying by $O$ corrects using the fresh observation.

Once you track beliefs, a POMDP becomes a fully-observed MDP — but over the continuous belief space, a simplex of distributions. That is the curse: you have traded a discrete state space for a continuous one of probability vectors.

flowchart LR
    B[belief b] -->|choose action a| ACT[act]
    ACT --> ENV[hidden world: true state transitions]
    ENV -->|observation o| UPD[Bayes update]
    B --> UPD
    UPD -->|new belief b'| B

Alpha vectors: why the value function is piecewise-linear

Here is the structural miracle that makes POMDPs solvable. The optimal value function over belief space is piecewise-linear and convex (PWLC). It can be written as the upper envelope of a finite set of linear functions, each represented by a vector $\alpha$ (an alpha vector) with one entry per state:

\[ V(b) = \max_{\alpha \in \Gamma}\; \sum_{s} \alpha(s)\, b(s) = \max_{\alpha \in \Gamma}\; \alpha \cdot b. \]

In words: each candidate plan scores your current belief by a simple weighted sum; the value of the belief is whichever plan scores highest. Also written: $V(b) = \max_{\alpha\in\Gamma} \alpha^\top b$ — the top edge (upper envelope) of a bundle of straight lines, one per plan.

Each $\alpha$ vector corresponds to a conditional plan — “do this, then react to the next observation thus” — and its value is linear in the belief because expected reward is linear in the state probabilities. At any belief $b$, the best plan is whichever $\alpha$ gives the highest dot product. Convexity has a satisfying meaning: value is highest at the corners of the simplex (you are certain of the state) and lowest in the middle (maximum confusion), because certainty is worth money.

Point-based value iteration (PBVI): plan only where you’ll actually be

Exact POMDP solving tracks every alpha vector, and their number explodes with the horizon — exact methods choke beyond toy problems. The breakthrough approximation is point-based value iteration. Instead of maintaining the value function over the whole simplex, sample a finite set of representative belief points $B = \{b_1, \dots, b_m\}$ — typically beliefs actually reachable from the start — and maintain just one alpha vector per point. Back up values only at those points.

The bet: the beliefs you actually encounter are a thin, low-dimensional sliver of the full simplex, so a value function pinned down at a few hundred reachable beliefs generalizes well to the rest (its PWLC structure interpolates between them). PBVI and its descendants (Perseus, SARSOP) turned POMDPs from textbook curiosities into something you can run on real robot navigation problems.

Online solvers: POMCP and DESPOT

Offline methods compute a policy for all beliefs in advance. Online solvers do the opposite — they plan from the current belief only, every step, by looking ahead with a search tree, then throw the plan away and re-plan after the next observation. This scales to enormous state spaces because you only reason about what is reachable from where you are right now.

POMCP (Partially Observable Monte Carlo Planning) is Monte Carlo Tree Search lifted to POMDPs. It runs simulations from the current belief through a tree whose nodes alternate action and observation branches, using random particles (sampled states) to represent the belief — no explicit transition or observation matrices needed, just a simulator. DESPOT tames the branching by evaluating the tree against a small fixed set of sampled “scenarios” (pre-drawn random outcomes), which sharply bounds the search width and comes with regret guarantees. Both are the practical workhorses when the state space is too large for PBVI.

QMDP: the cheap approximation that ignores future sensing

Sometimes you want something far simpler. QMDP makes one bold simplifying assumption: after the next step, the world becomes fully observable. Concretely, solve the underlying MDP (pretending you can see the state) to get state-action values $Q_{\text{MDP}}(s, a)$, then choose the action that is best on average under the current belief:

\[ a^\star = \arg\max_a \sum_s b(s)\, Q_{\text{MDP}}(s, a). \]

This is cheap — one MDP solve plus a belief-weighted average — and surprisingly effective. Its blind spot is structural: because it assumes future full observability, QMDP never values information-gathering. It will never take an action whose only purpose is to reduce uncertainty (peek around a corner, run a diagnostic test), because it believes uncertainty will vanish on its own next step. If your task requires deliberately sensing before acting, QMDP fails and you need a real POMDP solver.

Warning

QMDP’s pitfall is concrete: it cannot plan to gather information. A QMDP robot uncertain whether the goal is left or right will dither between the two rather than first moving to a vantage point that resolves the ambiguity — because it wrongly assumes the ambiguity resolves itself. Use it as a fast baseline, not when active sensing is the whole point.

Method	Offline / Online	Handles big state space?	Values information-gathering?	Cost
Exact (alpha vectors)	offline	no (toy only)	yes	extreme
PBVI / SARSOP	offline	moderate	yes	medium
POMCP / DESPOT	online	yes (with simulator)	yes	per-step
QMDP	offline	yes (MDP-sized)	no	cheap

25.14 — Optimal control and LQR: the continuous cousin of RL

Reinforcement learning grew up alongside an older, more mathematical sibling: optimal control. Both ask the same question — choose actions over time to optimize a long-run objective — but control theory studies the case where the dynamics are known and the state and action are continuous. Where RL says “state, action, reward, policy, value,” control says “state, control input, cost, controller, cost-to-go.” Same animal, different dialect; and one special case, the linear-quadratic regulator (LQR), is so clean it can be solved in closed form, making it the bedrock example that links the two fields.

The setup: linear dynamics, quadratic cost

LQR is the case where two things line up perfectly. The dynamics are linear — the next state is a matrix times the current state plus a matrix times the control:

\[ x_{t+1} = A x_t + B u_t, \]

and the cost to minimize is quadratic in state and control:

\[ J = \sum_{t=0}^{T} \big(x_t^\top Q\, x_t + u_t^\top R\, u_t\big). \]

In words: add up, over all time, a penalty for being away from the target ($x^\top Q x$) plus a penalty for shoving the controls hard ($u^\top R u$); the controller minimizes that total. Also written: for a scalar system this is just $J = \sum_t (q\,x_t^2 + r\,u_t^2)$ — a sum of squared error and squared effort.

Here $Q \succeq 0$ penalizes being away from the origin (the target state) and $R \succ 0$ penalizes large control effort. The tension is the whole story: $Q$ says “get to the goal fast,” $R$ says “but don’t slam the actuators.” A cruise controller lives this trade-off — $Q$ wants the exact set speed now, $R$ wants to avoid flooring the throttle.

The solution: a linear controller from the Riccati equation

The beautiful result: the optimal controller is linear in the state,

\[ u_t = -K_t\, x_t, \]

In words: the optimal control is just “look at where you are and push back proportionally toward the target” — a thermostat-style feedback law, with the gain matrix $K_t$ deciding how hard to push. Also written: component-wise for a scalar system, $u_t = -k\,x_t$ — bigger error, bigger correction.

a constant-feedback gain matrix $K_t$ that you compute offline, before ever running the system. The optimal cost-to-go is quadratic, $V_t(x) = x^\top P_t\, x$, mirroring the PWLC structure we saw for POMDPs — here the value function’s shape is exactly a paraboloid. The matrix $P_t$ is found by sweeping backward in time through the discrete-time Riccati equation:

\[ P_t = Q + A^\top P_{t+1} A - A^\top P_{t+1} B \big(R + B^\top P_{t+1} B\big)^{-1} B^\top P_{t+1} A, \]

starting from $P_T = Q$, and then the gain is read off as

\[ K_t = \big(R + B^\top P_{t+1} B\big)^{-1} B^\top P_{t+1} A. \]

Over an infinite horizon, $P_t$ converges to a fixed matrix $P$ (the algebraic Riccati equation) and the gain $K$ becomes constant — a single time-invariant feedback law. This is the same Bellman backup as in RL, only the recursion closes in matrix algebra instead of a table or a network.

flowchart LR
    A["set P_T = Q"] --> B["Riccati backup:<br/>P_t from P_t+1"]
    B --> C{"t = 0?"}
    C -->|no| B
    C -->|yes| D["gains K_t = (R+BᵀP B)⁻¹ BᵀP A"]
    D --> E["run forward:<br/>u_t = −K_t x_t"]

Worked example: settling a 1-D system

Take a scalar system: $x_{t+1} = x_t + u_t$ (so $A=B=1$), with $Q=1$, $R=1$, over an infinite horizon. The algebraic Riccati equation becomes $P = 1 + P - P^2/(1+P)$, i.e. $P^2/(1+P) = 1$, giving $P^2 - P - 1 = 0$, so $P = (1+\sqrt5)/2 \approx 1.618$ — the golden ratio. The gain is $K = P/(1+P) \approx 0.618$, and the closed-loop dynamics are $x_{t+1} = (1-K)x_t \approx 0.382\, x_t$: the state decays geometrically to zero, smoothly and stably, without us ever simulating a single rollout.

import numpy as np
A=B=np.array([[1.0]]); Q=R=np.array([[1.0]])
P = Q.copy()
for _ in range(200):                       # iterate Riccati to fixed point
    K = np.linalg.solve(R + B.T@P@B, B.T@P@A)
    P = Q + A.T@P@A - A.T@P@B@K
print("P =", P.ravel())   # ~1.618, the golden ratio
print("K =", K.ravel())   # ~0.618
x = 1.0                                     # roll out the optimal controller
for t in range(8):
    u = -(K.item())*x; x = x + u
    print(f"t={t} x={x:.4f}")               # decays ~0.382^t toward 0

Why this matters for RL

LQR is the Rosetta Stone between control and RL. It shows, in a case you can solve by hand, that the optimal policy is a feedback law and the value function has fixed structure — the very things deep RL tries to approximate when dynamics are unknown or nonlinear. The connection is load-bearing in practice:

iLQR / DDP linearize a nonlinear system around a trajectory and apply LQR repeatedly, the backbone of model-based control for robotics.
Model-based RL learns $A, B$ (or a nonlinear dynamics model) from data, then plans with control machinery.
The linear-quadratic-Gaussian (LQG) regulator bolts a Kalman filter onto LQR to handle noisy observations — LQR’s answer to partial observability, and a direct continuous-state echo of the belief-state idea from POMDPs.

Tip

The cleanest way to internalize the RL–control link: an MDP with linear dynamics and quadratic reward is LQR, and its “learned” optimal policy is exactly the Riccati feedback law. If your real problem is nearly linear near the operating point, do not reach for deep RL — linearize and solve the Riccati equation. It is faster, exact, and comes with stability guarantees that learned policies cannot promise.

25.15 — Model-based RL: learn the world, then plan in it

Every algorithm so far has been model-free: it learns a value or a policy straight from reward, never building an explicit picture of how the environment works. Model-based RL takes the other road. First learn a dynamics model $\hat{P}(s' \mid s, a)$ and reward model $\hat{R}(s,a)$ from experience, then use that learned model to plan — to imagine the consequences of actions without paying for them in the real world.

Intuition. Model-free RL is learning to drive purely by crashing until you stop crashing — every lesson costs a real collision. Model-based RL is building a mental simulator of the car and the road, then practicing thousands of turns in your head between real drives. The mental rehearsals are nearly free, so you need far fewer real ones. That is the whole appeal: sample efficiency. Model-free methods can need millions of environment steps; a good model lets you squeeze far more learning out of each real interaction — decisive when real steps are slow, costly, or dangerous (robots, healthcare, industrial control).

The price is model bias: the learned model is wrong, and a planner is ruthless about exploiting its errors. Plan a long rollout inside a slightly-off model and the imagined trajectory drifts into fantasy — the planner finds an “amazing” action that only works because the model is hallucinating. Managing this trade-off between cheap imagined data and untrustworthy imagined data is the central craft of the field.

flowchart LR
    ENV[Real environment] -->|real transitions| D[(Experience)]
    D -->|fit dynamics + reward| M[Learned model P̂, R̂]
    M -->|imagined rollouts| PLAN[Plan / improve policy]
    PLAN -->|action| ENV

Two ways to use a model

A learned model buys you two distinct things, and most systems pick one:

Background planning — generate fake data to train on. The classic is Dyna-Q: interleave real Q-learning updates with extra updates on transitions imagined by the model. Each real step is amplified into many cheap synthetic ones, so the value function converges in far fewer real steps. The update is the ordinary Q-learning backup, just fed model-generated $(s,a,r,s')$:

\[Q(s,a) \leftarrow Q(s,a) + \alpha\big[\hat{R}(s,a) + \gamma \max_{a'} Q(s',a') - Q(s,a)\big]\]

In words: run the exact same Q-learning update you already know, but on a transition the model dreamed up instead of one you actually lived — free practice between real moves. Also written: identical to the Section 25.4 backup with the real environment swapped for $(\hat{R}, \hat{P})$; Dyna simply runs $n$ of these planning updates per real step.

Decision-time planning — look ahead from the current state. Instead of pre-baking a policy, plan on demand: from where you are right now, use the model to simulate candidate futures and pick the best first action, then re-plan next step. MCTS (the engine inside AlphaZero, Section 25.10) is the discrete case; for continuous control, Model Predictive Control (MPC) with samplers like CEM or MPPI rolls out many action sequences through the model, keeps the best-scoring ones, and executes only the first action before re-planning.

The modern crop

System	What it models	Plans by	Domain
Dyna-Q	tabular $\hat{P},\hat{R}$	extra Q-updates on imagined data	small discrete
PETS	ensemble of probabilistic nets	MPC + CEM over model rollouts	continuous control
MuZero	latent dynamics (value-relevant only)	MCTS in latent space	games + Atari
Dreamer	latent “world model” (RSSM)	train actor-critic inside imagined latent rollouts	pixels, continuous

The frontier idea, shared by MuZero and Dreamer, is the world model: rather than predicting raw next pixels (hard and mostly irrelevant), learn a compact latent state and predict dynamics there — modelling only what matters for reward and value. Dreamer trains its entire actor-critic on rollouts hallucinated inside this latent space, touching the real environment only to refresh the model.

# Dyna-Q core: one real step, then n cheap planning steps from a learned model.
import random
def dyna_q(Q, model, env_step, s, alpha=0.1, gamma=0.9, n=10):
    a = epsilon_greedy(Q, s)
    s2, r = env_step(s, a)                       # ONE real interaction
    Q[s][a] += alpha*(r + gamma*max(Q[s2]) - Q[s][a])
    model[(s, a)] = (r, s2)                      # remember the transition
    for _ in range(n):                           # n FREE imagined updates
        (sp, ap), (rp, s2p) = random.choice(list(model.items()))
        Q[sp][ap] += alpha*(rp + gamma*max(Q[s2p]) - Q[sp][ap])
    return s2

Warning

Model-based RL trades sample efficiency for robustness: it learns from far fewer real steps but can be wrecked by model bias, and it is more complex to build and debug. Rule of thumb — reach for it when real environment steps are expensive (robotics, healthcare) and you can afford a careful model; stick with model-free PPO/SAC when steps are cheap (a fast simulator) and you just want something that reliably works.

25.16 — Influence diagrams, Dec-POMDPs, and Markov games

We close the decision-theoretic thread with the frame that sits above all the single-agent machinery and the multi-agent extensions that generalize it. Three ideas: a graphical language for one-shot decisions (influence diagrams), the value of gathering information before deciding, and what changes when multiple agents decide at once — cooperatively (Dec-POMDPs) or competitively (Markov games).

Influence diagrams: decisions as graphs

A Bayesian network models beliefs — random variables and their probabilistic dependencies. An influence diagram extends it with two new node types to model decisions:

Chance nodes (ovals) — random variables, as in a Bayes net.
Decision nodes (rectangles) — variables you control; arcs into a decision node show what you’ll know when you choose.
Utility nodes (diamonds) — the payoff, a function of its parents.

Solving an influence diagram means choosing the decision rule (a mapping from observed information to action) that maximizes expected utility — exactly a compact, graphical one-shot MDP. The arcs into decision nodes are the crucial part: they encode information availability, which sets up the next idea.

flowchart LR
    W([Weather]) --> U{{Payoff}}
    F([Forecast]) -.observed.-> D[Take umbrella?]
    W -.influences.-> F
    D --> U

Value of information: what is it worth to know first?

A decision-theoretic agent can reason about its own future knowledge. The value of information (VoI) of an observation is how much your expected utility rises if you get to see that observation before deciding, versus deciding blind:

\[ \text{VoI} = \mathbb{E}\big[\,\text{utility with the observation}\,\big] - \mathbb{E}\big[\,\text{utility without it}\,\big]. \]

In words: the worth of a piece of information is how much better your best decision becomes once you’re allowed to see it first. Also written: $\text{VoI} = \mathbb{E}_o[\max_a \mathbb{E}[U\mid o,a]] - \max_a \mathbb{E}[U\mid a]$ — average over what you might learn, of the best you can then do, minus the best you can do blind.

VoI is never negative — more information can’t hurt when you’re free to ignore it — and it is zero exactly when the information wouldn’t change your decision. This is the principle QMDP was missing back in the POMDP section: an agent that computes VoI will pay to run a diagnostic test or peek around a corner precisely when, and only when, the knowledge would alter its action.

Worked example. A drilling company decides whether to drill ($+\$200\text{k}$ if oil, $-\$100\text{k}$ if dry); oil has prior probability $0.5$. Blind, the expected value of drilling is $0.5(200) + 0.5(-100) = +\$50\text{k}$, so you drill and expect $\$50\text{k}$. Now a perfect seismic test reveals oil-or-dry for sure. With the test: if it says oil (prob $0.5$) you drill for $+\$200\text{k}$; if it says dry (prob $0.5$) you walk away for $\$0$. Expected utility with the test is $0.5(200) + 0.5(0) = \$100\text{k}$. So

\[ \text{VoI} = \$100\text{k} - \$50\text{k} = \$50\text{k}, \]

the most you should ever pay for that test. Notice why it has value: in the “dry” branch the test changes your decision from drill to don’t-drill, avoiding a $\$100\text{k}$ loss half the time.

Dec-POMDPs: cooperation under shared uncertainty

Now add agents. A decentralized POMDP (Dec-POMDP) is a team of agents sharing one reward and one world, but each acting on its own local observations with no free communication. Think of a fleet of warehouse robots that must coordinate to maximize joint throughput, each seeing only its own corner.

The brutal difficulty: each agent must reason not just about the hidden world but about what its teammates believe and will do — and they’re reasoning about it too, with different information. There is no shared belief state to collapse the problem onto. Dec-POMDPs are NEXP-complete, dramatically harder than single-agent POMDPs; optimal solutions exist only for tiny instances, and practice relies on approximations or centralized training with decentralized execution (CTDE), the dominant pattern in cooperative multi-agent RL.

Markov games: when agents have their own agendas

Drop the shared reward and you get a Markov game (stochastic game) — the multi-agent generalization of the MDP, and the natural home of competition. Each agent $i$ has its own reward $R_i$; the state transitions depend on everyone’s joint action.

The conceptual shift is that “optimal” no longer means a single best policy — it means an equilibrium. In a zero-sum two-player game (one’s gain is the other’s loss, like the board games covered earlier in this chapter) the solution is a minimax policy, and value iteration with a minimax backup converges. In general-sum games, the target is a Nash equilibrium: a joint policy where no single agent can improve by unilaterally deviating. Such equilibria exist but may not be unique, and finding them is hard — there is no single scalar “value” everyone agrees to optimize.

flowchart TB
    R{"reward structure?"} -->|"single shared reward,<br/>full observability"| M["MDP / Markov game (1 agent)"]
    R -->|"single shared reward,<br/>partial obs, decentralized"| DP["Dec-POMDP<br/>(cooperative, NEXP-hard)"]
    R -->|"per-agent rewards,<br/>opposed"| ZS["zero-sum Markov game<br/>(minimax)"]
    R -->|"per-agent rewards,<br/>mixed"| GS["general-sum Markov game<br/>(Nash equilibrium)"]

Framework	Agents	Reward	Observability	Solution concept	Hardness
MDP	1	single	full	optimal policy	P
POMDP	1	single	partial	optimal policy (belief)	PSPACE-hard
Dec-POMDP	many	shared	partial, decentralized	optimal joint policy	NEXP-complete
Zero-sum Markov game	2	opposed	full	minimax policy	tractable (LP)
General-sum Markov game	many	per-agent	full	Nash equilibrium	hard (PPAD)

Tip

The unifying view of this whole chapter: an MDP is the base case, and every richer framework relaxes one assumption. POMDPs drop full observability (track beliefs); LQR specializes to continuous linear-quadratic dynamics (solve Riccati); Dec-POMDPs and Markov games add agents (solve for joint policies or equilibria). The algorithms differ, but the Bellman backbone — value, lookahead, fixed-point iteration — runs through all of them.

25.17 — Reward shaping and the sparse-reward problem

Imagine teaching someone chess if the only feedback allowed is “you won” or “you lost” at the very end. They could play a hundred games before stumbling onto a win, with no clue which of the forty moves in each game helped. That is the sparse-reward problem, and it is the single most common reason RL fails in practice: when reward arrives rarely (a goal scored, a maze solved) the agent explores almost blindly, because nearly every action returns zero and there is nothing to climb toward.

The tempting fix is reward shaping: add small intermediate rewards that nudge the agent in the right direction — a little bonus for moving closer to the goal, a small penalty for wasting time. Done casually, this is dangerous. If you reward a cleaning robot for picking up trash, it may learn to knock the bin over and re-collect the same trash forever. The agent optimizes the reward you wrote, not the behavior you meant — a phenomenon called reward hacking.

Tip

Intuition: shaping is like leaving breadcrumbs to the goal. Helpful if they genuinely point the way — disastrous if the agent discovers it can eat breadcrumbs in a circle instead of reaching the door.

The principled cure is potential-based reward shaping (Ng, Harada & Russell, 1999). Define a potential $\Phi(s)$ on states — a rough “how good is this state” hint, like negative distance to the goal — and add only the change in potential as a shaping reward:

\[F(s, a, s') = \gamma\,\Phi(s') - \Phi(s).\]

In words: reward the agent for the improvement in your hint between where it was and where it landed, not for the hint’s absolute level — so moving toward the goal pays and moving away costs, but going in circles nets exactly zero. Also written: the agent optimizes $R'(s,a,s') = R(s,a,s') + \gamma\Phi(s') - \Phi(s)$; because the $\Phi$ terms telescope along any trajectory, the total added reward depends only on start and end.

The remarkable guarantee: a potential-based shaping term leaves the optimal policy unchanged. The telescoping sum means it cannot be farmed in a loop (a cycle returns to where it started, so the potential changes cancel to zero), which is exactly what prevents the trash-knocking exploit. You get faster learning for free, with no risk of changing what “optimal” means.

flowchart LR
    SP["sparse reward only<br/>(1 at goal, 0 elsewhere)"] -->|"blind exploration,<br/>slow"| A[agent]
    PB["+ potential shaping<br/>γΦ(s′) − Φ(s)"] -->|"dense gradient to climb,<br/>same optimum"| A

When even shaping is not enough, three other ideas attack sparsity:

Curiosity / intrinsic motivation — manufacture an internal reward for visiting surprising states (where a learned dynamics model predicts poorly). The agent explores out of “boredom-avoidance” until it stumbles onto real reward. This is what cracked notoriously sparse games like Montezuma’s Revenge.
Hindsight Experience Replay (HER) — after failing to reach goal $g$, pretend the state you actually ended in was the goal all along. Every failed episode becomes a successful demonstration of reaching some goal, so even a never-successful agent learns from every rollout. Indispensable for goal-conditioned robotics.
Curriculum learning — start with easy, nearby goals where reward is reachable, then gradually widen the task as competence grows (closely related to the self-play curriculum of Section 25.9).

# Hindsight Experience Replay: relabel a failed trajectory with an achieved goal.
# Original transition aimed at goal g but never reached it -> reward stays 0 forever.
# HER adds a copy whose "goal" is a state actually visited later, making reward fire.
for t, (s, a, s_next) in enumerate(trajectory):
    buffer.add(s, a, s_next, goal=g, reward=reward_fn(s_next, g))   # real attempt
    g_hindsight = trajectory[-1].achieved_state                    # pretend the end was the goal
    buffer.add(s, a, s_next, goal=g_hindsight,
               reward=reward_fn(s_next, g_hindsight))              # now sometimes = success

Warning

Before hand-crafting a dense reward, ask whether it admits a shortcut the agent will find and you won’t. The safest dense signals are potential-based (provably exploit-proof) or learned from human preference (the RLHF route). Arbitrary engineered bonuses are where reward hacking lives.

25.18 — Offline RL: learning from a fixed dataset

Every method so far assumes the agent can act in the environment to gather fresh data. Often it cannot. You have a hospital’s logged treatment histories, a fleet’s recorded driving, a recommender’s click logs — and letting a half-trained policy experiment on real patients or real cars is unthinkable. Offline RL (also “batch RL”) asks: can we learn the best possible policy from a fixed dataset of past interactions, with no further environment access at all?

Intuition: it is the difference between a student pilot in a real cockpit (online RL — risky but informative) and one who can only study a thick logbook of other pilots’ flights (offline RL — safe, but you can never test a maneuver nobody recorded).

The central obstacle is distributional shift combined with overestimation. Break it into three steps. (1) For an action the dataset barely covers, the value network is guessing — it has no real data to anchor $Q(s,a)$ there. (2) Some of those wild guesses come out too high. (3) Training always picks the action with the largest $Q$, so it homes straight in on the most over-inflated guess. The policy ends up confidently recommending actions it has never actually seen pay off. Online RL would catch this — try the action, get a low reward, fix the estimate — but offline RL has no such feedback loop, so the delusion never gets punctured.

The whole field is therefore organized around one principle: stay close to the data, or be honest about uncertainty. Three families implement it:

Approach	Core idea	Representative method
Policy constraint	Keep the learned policy near the behavior policy that generated the data	BCQ, BEAR, TD3+BC
Value penalization	Push $Q$ down on out-of-distribution actions so the max can’t chase phantoms	CQL (Conservative Q-Learning)
In-sample / implicit	Never query unseen actions at all; learn using only actions in the dataset	IQL (Implicit Q-Learning)

CQL’s idea is the cleanest to state: add a penalty that lowers $Q$ for actions the policy would pick but the data doesn’t support, while keeping $Q$ honest on the actions actually logged.

\[\min_\theta\ \alpha\,\mathbb{E}_{s\sim\mathcal{D}}\Big[\log\sum_a e^{Q_\theta(s,a)} - \mathbb{E}_{a\sim\mathcal{D}}\,Q_\theta(s,a)\Big] + L_{\text{TD}}(\theta)\]

In words: on top of the usual TD loss, push down the $Q$-values of all actions the network is tempted to inflate, and pull up only the actions that really appear in the data — so unseen actions can never look better than seen ones. Also written: the bracket is “average over-optimistic value minus average logged value”; minimizing it yields a conservative (lower-bound) $Q$ that the policy can safely maximize.

A simple TD3+BC recipe shows how light the touch can be — take a standard off-policy actor-critic and add one behavior-cloning term to the actor so it never drifts far from logged actions:

# TD3+BC actor loss: maximize Q, but stay close to the dataset actions.
import torch.nn.functional as F
pi_a   = actor(s)                                  # action the policy wants
q      = critic(s, pi_a)
lmbda  = 2.5 / q.abs().mean().detach()             # auto-scale the RL term
actor_loss = -lmbda * q.mean() + F.mse_loss(pi_a, a_data)   # +BC anchor to logged action a_data

Mature implementations live in libraries like d3rlpy (CQL, IQL, TD3PlusBC), which train directly from a logged dataset:

import d3rlpy
dataset, env = d3rlpy.datasets.get_d4rl("hopper-medium-v2")   # a standard offline benchmark
cql = d3rlpy.algos.CQLConfig().create(device="cpu")
cql.fit(dataset, n_steps=100_000)                            # no environment interaction during training

Tip

Offline RL is what makes RL usable in high-stakes domains — healthcare, finance, recommendation — where exploration is expensive, dangerous, or illegal. The mental model: online RL asks the world questions; offline RL must answer using only the questions someone already asked, and its whole craft is resisting the urge to trust answers it can’t check.

25.19 — Quick reference

Term / formula	Meaning	When / why it matters
MDP $(\mathcal{S},\mathcal{A},P,R,\gamma)$	Formal model of an agent acting under the Markov property	The universal frame; every method assumes (or relaxes) it
Return $G_t=\sum_k \gamma^k R_{t+k+1}$	Total discounted future reward from time $t$	The quantity RL actually maximises; $\gamma$ sets the horizon
Bellman optimality $V^(s)=\max_a \sum_{s'}P[R+\gamma V^(s')]$	Value of a state = best action’s reward-plus-future-value	Fixed point that DP, Q-learning, and DQN all chase
Value / policy iteration	Exact DP backups when $P,R$ are known	Use only when the model is known and states are few
MC vs TD($n$-step)	Wait for real return vs bootstrap from next estimate	TD: lower variance, online, continuing tasks → modern default
SARSA (on-policy)	Update toward the next action actually taken	Learns safer policies that account for exploration cost
Q-learning (off-policy)	Update toward $\max_{a'}Q(s',a')$	Learns the optimal greedy policy; needs the double-Q fix for bias
DQN (replay + target net)	Neural $Q$ with decorrelated inputs and frozen targets	Scales Q-learning to pixels; both tricks needed for stability
REINFORCE / policy gradient	$\nabla_\theta J=\mathbb{E}[\nabla\log\pi_\theta\,G_t]$	Direct policy optimisation; handles continuous/stochastic actions
Advantage $A_t=G_t-V(s_t)$ + GAE($\gamma,\lambda$)	Baseline-subtracted credit, bias–variance dialled by $\lambda$	$\lambda\approx0.95$ in almost every PPO/A2C run
PPO clipped objective	Limit policy ratio to $1\pm\epsilon$ per update	Stable workhorse; the algorithm behind RLHF
Bandit / regret	One-state MDP; $\text{Regret}=\sum_a\Delta_a\mathbb{E}[N_a]$	Explore–exploit in isolation; aim for $O(\log n)$ regret
UCB / Thompson sampling	Optimism-under-uncertainty / posterior sampling	Targeted exploration; beats fixed-$\varepsilon$’s linear regret
Imitation: BC / DAgger / IRL	Copy actions / relabel on-policy / recover the reward	Skip reward design; DAgger fixes BC’s $O(\epsilon T^2)$ drift
POMDP + belief state $b(s)$	Track a distribution over hidden states via Bayes filter	When you can’t observe the true state; value is PWLC in $b$
LQR (Riccati, $u=-Kx$)	Closed-form optimal control for linear-quadratic systems	Near-linear problems: linearize and solve, don’t reach for deep RL
Model-based RL (Dyna, MuZero, Dreamer)	Learn dynamics, then plan/imagine inside the model	Sample efficiency when real steps are costly; beware model bias
Potential-based shaping $\gamma\Phi(s')-\Phi(s)$	Dense reward that provably preserves the optimum	Speed up sparse-reward learning without reward hacking
Offline RL (CQL, IQL, TD3+BC)	Learn from a fixed dataset, no environment access	High-stakes domains; fight overestimation by staying near the data

25.20 — Key takeaways

RL maximises expected cumulative discounted reward through interaction, not labelled answers; the MDP $(\mathcal{S},\mathcal{A},P,R,\gamma)$ is its universal model.
Bellman equations tie a state’s value to its successors’; dynamic programming solves them exactly when the model is known but does not scale.
Model-free learning splits into Monte Carlo (wait for the real return, unbiased) and TD (bootstrap each step, lower variance) — the $n$-step return is the dial between them, and TD underpins most modern RL.
SARSA (on-policy) learns the policy it follows; Q-learning (off-policy) learns the optimal one via a max. DQN scales Q-learning with experience replay + a target network, and Double/Dueling/Prioritised/Rainbow refine it.
Policy-gradient / actor-critic methods (REINFORCE, A2C, PPO) optimise a policy directly; GAE dials the advantage’s bias–variance trade-off, and PPO’s clipped update is the backbone of RLHF.
The exploration–exploitation tradeoff, studied cleanly in bandits (ε-greedy, UCB, Thompson) and scored by regret, recurs everywhere in RL.
Multi-agent RL adds non-stationarity; self-play turns it into an automatic curriculum, powering AlphaGo → AlphaZero → MuZero.
Imitation learning (behavioral cloning, DAgger, inverse RL) skips reward design by learning from expert demonstrations; POMDPs track a belief state when the world is only partly observable; LQR solves the continuous linear-quadratic case in closed form via the Riccati equation.
Model-based RL (Dyna, PETS, MuZero, Dreamer) learns a dynamics model and plans inside it for sample efficiency, at the cost of model bias.
Sparse rewards stall learning; potential-based shaping speeds it up without changing the optimum, while curiosity, HER, and curricula attack sparsity from other angles. Offline RL (CQL, IQL, TD3+BC) learns from a fixed logged dataset by staying close to the data and resisting overestimation.

25.21 — See also

Optimization — gradient ascent, SGD, and the loss-landscape view behind policy gradients.
Probability & Statistics — expectations, variance, and the Bayesian posteriors underlying Thompson sampling.
Neural Networks (Core) — the function approximators inside DQN and actor-critic.
Large Language Models — where PPO-based RLHF aligns model behaviour to human preferences.

↪ The thread continues → Chapter 26 · 🛒 Recommender Systems

We’ve covered how models learn; the next chapters are about what they’re for in the wild, starting with the quiet giant of applied ML — recommender systems.

📖 All chapters | ← 24 · 🌈 Multimodal AI | 26 · 🛒 Recommender Systems →