Chapter 16 — 🔁 Recurrent & Sequence Models

📖 All chapters | ← 15 · 🖼️ Convolutional Neural Networks | 17 · ⚡ Attention & Transformers →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Most data in the world arrives as a sequence: words in a sentence, samples of audio, daily sensor readings, frames of video. A plain feedforward network sees a fixed-size vector and forgets everything between examples — it has no notion of “before” and “after.” Recurrent and sequence models fix this by giving the network a memory that carries information forward as it reads a sequence one step at a time. They are the bridge between classical neural nets and the attention-based Transformers that now dominate language and vision, and understanding them is the cleanest way to understand why attention was invented.

🧭 In context: Deep Learning · models for ordered/variable-length data (text, audio, time series) · the one key idea: a recurring hidden state that summarizes the past so the same weights can be applied at every time step.

💡 Remember this: a recurrent net is a for loop with shared weights — one hidden state carries the past forward — and every advance (gating, attention, Transformers) exists to fix how that single thread of memory fails over long sequences.

16.1 — Recurrent Neural Networks

A recurrent neural network (RNN) processes a sequence \(x_1, x_2, \dots, x_T\) one element at a time, maintaining a hidden state \(h_t\) — a vector that acts as the network’s running memory of everything it has seen so far. At each step it combines the new input with the previous memory to produce a new memory:

\[h_t = \tanh(W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h), \qquad y_t = W_{hy}\,h_t + b_y\]

In words: the new memory is a squashed mix of this step’s input and last step’s memory; the output is a linear read-out of that memory. Also written: \(h_t = \tanh\!\big(W_{xh} x_t + W_{hh} h_{t-1} + b_h\big)\) is often compressed to \(h_t = \tanh\!\big(W[h_{t-1}, x_t] + b\big)\), where \(W = [\,W_{hh}\;\;W_{xh}\,]\) is the two matrices stacked side by side and \([h_{t-1}, x_t]\) is the two vectors concatenated.

The crucial detail is that the same weight matrices \(W_{xh}, W_{hh}, W_{hy}\) are reused at every time step. This weight sharing is what lets an RNN handle sequences of any length with a fixed number of parameters — exactly the way a convolutional filter (Chapter 15) shares weights across space, an RNN shares weights across time.

Think of reading a sentence. Your understanding after word 5 (\(h_5\)) depends on word 5 plus your understanding after word 4 (\(h_4\)). You apply the same “reading process” to every word; the running interpretation is the hidden state.

Unrolling through time. A single recurrent cell with a self-loop is hard to reason about. We unroll it: draw one copy of the cell per time step, with the hidden state threaded from each copy to the next. The unrolled picture is just a very deep feedforward network whose layers happen to share weights.

flowchart LR
    x1["x₁"] --> A1((h₁))
    x2["x₂"] --> A2((h₂))
    x3["x₃"] --> A3((h₃))
    h0["h₀"] --> A1
    A1 -->|W_hh| A2
    A2 -->|W_hh| A3
    A1 --> y1["y₁"]
    A2 --> y2["y₂"]
    A3 --> y3["y₃"]

Worked example — counting with a tiny RNN. Let the hidden state be a single scalar, \(W_{hh}=1\), \(W_{xh}=1\), \(b_h=0\), and drop the nonlinearity. Feed the inputs \(x = [2, 3, 5]\) starting from \(h_0 = 0\):

\(h_1 = 1\cdot 2 + 1\cdot 0 = 2\)
\(h_2 = 1\cdot 3 + 1\cdot 2 = 5\)
\(h_3 = 1\cdot 5 + 1\cdot 5 = 10\)

The hidden state has learned to accumulate a running sum — a memory of the whole prefix in one number. Different weights would teach it to remember different things.

import numpy as np
# tiny RNN cell, one forward pass over a sequence
def rnn_forward(xs, Wxh, Whh, Why, bh, by):
    h = np.zeros(Whh.shape[0])      # h0 = zeros
    ys = []
    for x in xs:                    # one step per element
        h = np.tanh(Wxh @ x + Whh @ h + bh)   # update memory
        ys.append(Why @ h + by)     # emit output
    return np.array(ys), h

The same cell exists ready-made in every framework. In PyTorch the whole loop above is one module — you hand it the full sequence and it unrolls internally:

import torch, torch.nn as nn
rnn = nn.RNN(input_size=8, hidden_size=16, num_layers=1, batch_first=True)
x = torch.randn(4, 10, 8)           # (batch, time, features)
outputs, h_n = rnn(x)               # outputs: (4, 10, 16); h_n: last hidden state

Backpropagation through time (BPTT). To train, we unroll the network across the whole sequence and run ordinary backpropagation on that deep graph. Because \(W_{hh}\) is shared, its gradient is the sum of the contributions from every time step:

\[\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}\]

In words: the recurrent weight is used at every time step, so its total gradient is the pile-up of one gradient per step. Also written: \(\nabla_{W_{hh}} L = \sum_{t=1}^{T} \nabla_{W_{hh}} L_t\) — a sum over the shared parameter’s many uses, the same accumulation rule any tied/weight-shared parameter obeys.

This is BPTT, and it has one fatal weakness. To learn that step 1 matters for the loss at step \(T\), the gradient has to travel all the way back from \(T\) to \(1\). Every hop backward multiplies it by the same recurrent factor. Multiply one number by itself \(T\) times and only two things can happen:

if that number is less than 1, the product shrinks toward zero — the gradient vanishes, and the RNN simply cannot feel that early inputs mattered;
if it is greater than 1, the product blows up — the gradient explodes, and training destabilizes.

Concretely, in the one-dimensional case (hidden state is a scalar with recurrent weight \(w\)), the factor is \(w\cdot\tanh'\), and the gradient reaching step 1 is roughly that factor raised to the \(T\)-th power — exactly the runaway-or-die behavior above. In the full vector case the same story holds, governed by the largest singular value of \(W_{hh}\) instead of a single number. This one problem is the entire reason gated architectures (next sections) exist.

The two failure modes are easy to picture: trace one number back through the unrolled chain and watch it shrink to nothing or blow up.

Tip

Intuition: an RNN is a for loop with learnable weights. Picture the unrolled graph, not the self-loop — every confusing thing about RNNs (BPTT, vanishing gradients, truncation) is obvious once you see it as a very deep, weight-shared network.

Warning

Common mistake: treating long-range dependencies as “just a bit harder” for a vanilla RNN. In practice a plain \(\tanh\) RNN struggles to connect events more than ~10–20 steps apart. If your task needs longer memory, reach for an LSTM/GRU or attention rather than a deeper vanilla RNN.

16.2 — Sequence Task Shapes & Truncated BPTT

Before the gated cells, it helps to name what you are asking a recurrent net to do and how you actually train it on long streams — two practical pieces every textbook covers and that decide how you wire the model.

The four task shapes. RNNs are flexible because input and output lengths can differ independently. Karpathy’s classic picture sorts almost every sequence task into one of four patterns:

Shape	Input → Output	Example	How to read the RNN
one-to-many	one vector → sequence	image → caption	feed input once, generate until a stop token
many-to-one	sequence → one vector	review → sentiment	run to the end, use the last hidden state
many-to-many (aligned)	sequence → same-length sequence	tag each word (POS)	emit one output per step
many-to-many (seq2seq)	sequence → different-length sequence	translate a sentence	encoder then decoder (§16.6)

flowchart LR
    subgraph one_to_many["one-to-many"]
      a1["x"] --> a2["■■■■"]
    end
    subgraph many_to_one["many-to-one"]
      b1["■■■■"] --> b2["y"]
    end
    subgraph aligned["many-to-many aligned"]
      c1["■■■■"] --> c2["■■■■"]
    end

The lesson is purely about plumbing: the cell is the same RNN/LSTM/GRU in every case — what changes is which hidden states you read out and feed back. A sentiment classifier ignores all outputs but the last; a tagger keeps every output; a caption generator feeds each emitted token back as the next input.

Truncated BPTT. A document or audio stream can be tens of thousands of steps long. Unrolling that whole thing to compute gradients would blow up memory and run impossibly slowly. Truncated BPTT chops the stream into chunks of, say, 100 steps: you backpropagate gradients only within a chunk, but you carry the hidden state forward (detached from the graph) into the next chunk. So the forward memory still spans the whole sequence, while the backward gradient sees only one chunk.

The picture: the forward pass (top) flows unbroken across the whole stream, but each chunk’s backward pass (bottom) stops at the chunk boundary — gradients never cross the dashed “detach” lines.

\[\frac{\partial L}{\partial W_{hh}} \;\approx\; \sum_{t=T-k}^{T} \frac{\partial L_t}{\partial W_{hh}}\]

In words: instead of summing gradient contributions over all time steps, only sum over the last \(k\) — the recent window you actually unrolled. Also written: “TBPTT(\(k_1, k_2\))” — process \(k_1\) new steps then backpropagate \(k_2\) steps back; the simple version above takes \(k_1 = k_2 = k\).

# truncated BPTT: carry h across chunks, but detach so grads stop at the chunk boundary
h = None
for chunk in stream_chunks(data, length=100):     # each chunk is (batch, 100, feat)
    if h is not None:
        h = h.detach()                            # cut the graph; keep the values
    out, h = rnn(chunk, h)
    loss = criterion(head(out), targets_for(chunk))
    loss.backward()                               # only unrolls within this chunk
    optimizer.step(); optimizer.zero_grad()

Gradient clipping is the standard partner fix for the exploding side of §16.1. Whenever the global gradient norm exceeds a threshold \(\tau\), rescale the whole gradient down so its norm is exactly \(\tau\) — direction preserved, magnitude capped:

\[g \leftarrow g \cdot \frac{\tau}{\max(\|g\|, \tau)}\]

In words: if the gradient is too long, shrink it back to length \(\tau\) without changing where it points; otherwise leave it alone. Also written: \(g \leftarrow \min\!\big(1, \tfrac{\tau}{\|g\|}\big)\, g\) — equivalent, and in PyTorch literally torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=tau).

Worked example — clipping in numbers. Say the threshold is \(\tau = 5\) and one backward pass produces a gradient with norm \(\|g\| = 20\). The scale factor is \(\tfrac{5}{\max(20,5)} = \tfrac{5}{20} = 0.25\), so every component is multiplied by \(0.25\) — the gradient now has norm exactly \(5\) and points in the same direction. A different step with \(\|g\| = 3\) gets factor \(\tfrac{5}{\max(3,5)} = 1\), so it passes through untouched. Clipping only ever shrinks the occasional spike; it never grows or rotates a healthy gradient.

Tip

Rule of thumb: vanishing gradients are an architecture problem (use a gated cell); exploding gradients are a training problem (clip the norm). Reach for the matching tool — clipping a vanilla RNN will not give it long memory.

16.3 — Long Short-Term Memory (LSTM)

The LSTM (Hochreiter & Schmidhuber, 1997) solves vanishing gradients with one elegant idea: alongside the hidden state, carry a cell state \(c_t\) — a memory “conveyor belt” that runs straight down the sequence with only minor, additive interactions. Because information moves along \(c_t\) mostly by addition rather than repeated matrix multiplication, gradients can flow across hundreds of steps without vanishing.

Three gates — small sigmoid networks producing values in \([0,1]\) — decide what happens to the cell state. A gate output near 0 means “block,” near 1 means “let through.”

Forget gate \(f_t\): what to erase from the old cell state.
Input gate \(i_t\): what new information to write.
Output gate \(o_t\): what part of the cell to expose as the hidden state.

\[ \begin{aligned} f_t &= \sigma(W_f[h_{t-1}, x_t] + b_f) \\ i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i) \\ \tilde c_t &= \tanh(W_c[h_{t-1}, x_t] + b_c) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde c_t \\ o_t &= \sigma(W_o[h_{t-1}, x_t] + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{aligned} \]

In words: decide how much old memory to keep (\(f_t\)) and how much new candidate to add (\(i_t\)); the cell becomes kept old + added new; the hidden state is a gated (\(o_t\)) window onto that cell. Also written: the central update \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\) is a leaky accumulator — equivalently \(c_t = c_{t-1} + \big(i_t \odot \tilde c_t - (1-f_t)\odot c_{t-1}\big)\), i.e. old value plus a correction, which is why the gradient path along \(c\) is near-identity when \(f_t \approx 1\).

Here \(\odot\) is element-wise multiplication and \([h_{t-1}, x_t]\) is the two vectors concatenated. The line \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\) is the heart of it: keep a fraction of the old memory, add a fraction of the new candidate.

flowchart LR
    cprev["c₍t-1₎"] --> MUL1["× (forget)"]
    f["f_t σ"] --> MUL1
    MUL1 --> ADD["+"]
    i["i_t σ"] --> MUL2["×"]
    cand["c̃_t tanh"] --> MUL2
    MUL2 --> ADD
    ADD --> cnext["c_t"]
    cnext --> TANH["tanh"]
    o["o_t σ"] --> MUL3["×"]
    TANH --> MUL3
    MUL3 --> hnext["h_t"]

Worked example — the gate that makes memory survive. Take a single cell dimension holding \(c_{t-1} = 0.8\) (say, the fact “subject is plural”). Suppose at this step the gates compute \(f_t = 1\), \(i_t = 0\), and the candidate is \(\tilde c_t = -0.5\). The cell update is

\[c_t = f_t\cdot c_{t-1} + i_t\cdot\tilde c_t = 1\cdot 0.8 + 0\cdot(-0.5) = 0.8.\]

The memory is copied verbatim — the incoming candidate is multiplied by zero and ignored, so nothing overwrites the stored fact. Run the same gates for ten more steps and \(c\) stays exactly \(0.8\) the whole way; the gradient along that path is multiplied by \(f_t = 1\) each time, so it neither vanishes nor explodes. Now flip the gates to \(f_t = 0\), \(i_t = 1\): then \(c_t = 0\cdot 0.8 + 1\cdot(-0.5) = -0.5\), a clean overwrite. Same machinery, opposite behavior — that selectable mix of “keep” and “write” is what a vanilla RNN cannot do.

To make the additive path concrete, here is the full cell as code:

import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))

def lstm_step(x, h, c, P):           # P holds all weight matrices/biases
    z = np.concatenate([h, x])       # [h_{t-1}, x_t]
    f = sigmoid(P['Wf'] @ z + P['bf'])   # forget
    i = sigmoid(P['Wi'] @ z + P['bi'])   # input
    o = sigmoid(P['Wo'] @ z + P['bo'])   # output
    g = np.tanh (P['Wc'] @ z + P['bc'])  # candidate
    c = f*c + i*g                    # additive memory update
    h = o*np.tanh(c)
    return h, c

In production you never hand-roll this — a one-line nn.LSTM runs the optimized, cuDNN-backed version, and a tiny head turns the sequence into a prediction:

import torch, torch.nn as nn

class Tagger(nn.Module):              # many-to-one: classify a whole sequence
    def __init__(self, vocab, emb=64, hid=128, classes=2):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hid, batch_first=True)
        self.head  = nn.Linear(hid, classes)
    def forward(self, x):             # x: (batch, time) token ids
        e = self.embed(x)
        out, (h_n, c_n) = self.lstm(e)
        return self.head(h_n[-1])     # use the final hidden state

In words, the worked sentence “The keys that I left on the table by the door … are on the floor” is exactly this trick at scale: the LSTM writes “plural” into a cell dimension via the input gate, holds the forget gate near 1 so it survives every intervening word untouched, then opens the output gate when the verb is needed. A vanilla RNN would have washed that signal out long before “are.”

A practical detail — the forget-gate bias. A standard trick is to initialize \(b_f\) to a small positive number (often \(1\) or \(2\)). That starts every forget gate near \(\sigma(1)\approx 0.73\), so the cell defaults to remembering and only learns to forget where forgetting helps. Without it an LSTM can spend the early epochs erasing its own memory before it learns to keep anything — a one-line change that noticeably speeds convergence on long-dependency tasks.

Tip

Intuition: the cell state is a notebook. The forget gate erases lines, the input gate writes new ones, the output gate decides which lines you read aloud right now. Addition (not multiplication) along that notebook is the trick that keeps gradients alive.

16.4 — Gated Recurrent Unit (GRU)

The GRU (Cho et al., 2014) is a streamlined LSTM: it merges the cell and hidden state into one vector and uses just two gates instead of three, giving fewer parameters and faster training with usually comparable accuracy.

Update gate \(z_t\): interpolates between keeping the old state and adopting a new candidate (it fuses the LSTM’s forget and input gates into one knob).
Reset gate \(r_t\): how much past state to use when forming the candidate.

\[ \begin{aligned} z_t &= \sigma(W_z[h_{t-1}, x_t]) \\ r_t &= \sigma(W_r[h_{t-1}, x_t]) \\ \tilde h_t &= \tanh(W_h[\,r_t \odot h_{t-1},\, x_t\,]) \\ h_t &= (1 - z_t)\odot h_{t-1} + z_t \odot \tilde h_t \end{aligned} \]

In words: one gate (\(z_t\)) decides how much of the new candidate to mix into the old state; another (\(r_t\)) decides how much past to use while building that candidate. Also written: the blend \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot\tilde h_t\) is an exponential-moving-average update with a learned, per-dimension rate \(z_t\) — the same convex-combination form as \(\text{new} = (1-\alpha)\,\text{old} + \alpha\,\text{fresh}\).

The final line is a convex blend: when \(z_t \approx 0\) the unit copies the previous state verbatim (perfect long-term memory, gradient passes through untouched); when \(z_t \approx 1\) it overwrites with the fresh candidate. One gate, both behaviors.

flowchart LR
    hprev["h₍t-1₎"] --> KEEP["× (1 − z_t)"]
    z["z_t σ"] --> KEEP
    z --> WRITE["× z_t"]
    cand["h̃_t  (uses r_t)"] --> WRITE
    KEEP --> ADD["+"]
    WRITE --> ADD
    ADD --> hnext["h_t"]

Worked example — the one knob, both extremes. Take a single dimension with old state \(h_{t-1} = 0.9\) and a fresh candidate \(\tilde h_t = 0.1\). The update gate alone decides the outcome:

\(z_t \approx 0\) (gate shut, remember): \(h_t = (1-0)\cdot 0.9 + 0\cdot 0.1 = 0.9\). The old value is carried forward untouched — long-term memory, gradient multiplied by \((1-z_t)=1\).
\(z_t \approx 1\) (gate open, overwrite): \(h_t = (1-1)\cdot 0.9 + 1\cdot 0.1 = 0.1\). The past is discarded and the candidate takes over.
\(z_t = 0.5\) (blend): \(h_t = 0.5\cdot 0.9 + 0.5\cdot 0.1 = 0.5\) — a smooth interpolation between the two.

This is the same keep-or-write choice the LSTM splits across its forget and input gates, here collapsed into a single number. The reset gate \(r_t\) plays a different role: it only affects how the candidate is built — set \(r_t \approx 0\) and the candidate \(\tilde h_t\) ignores the past entirely, letting the unit start a fresh representation when a sequence boundary arrives.

Swapping LSTM for GRU in a framework is a one-word change — same call shape, fewer parameters under the hood:

import torch.nn as nn
gru = nn.GRU(input_size=64, hidden_size=128, num_layers=2,
             batch_first=True, dropout=0.2)   # drop-in replacement for nn.LSTM

Property	Vanilla RNN	LSTM	GRU
Gates	0	3 (f, i, o)	2 (z, r)
Separate cell state	no	yes (\(c_t\))	no
Params per unit	fewest	most	middle
Long-range memory	poor	excellent	very good
Typical use	toy / very short	long dependencies, ample data	strong default, smaller data

The honest practical takeaway: GRU and LSTM perform similarly on most tasks. Start with a GRU (cheaper); switch to an LSTM if you need the extra capacity and have the data to train it.

Warning

Common mistake: assuming more gates always means more accuracy. On small or moderate datasets the GRU often matches or beats the LSTM precisely because it has fewer parameters to overfit. Let validation performance, not gate count, decide.

16.5 — Bidirectional & Deep RNNs

A standard RNN reads strictly left-to-right, so \(h_t\) knows the past but not the future. Yet in many tasks the future disambiguates the present: in “I left my bank …”, the right word — riverbank or financial — depends on what comes after. A bidirectional RNN (BiRNN) runs two independent RNNs, one forward and one backward, then concatenates their hidden states at each position so every output sees the whole sequence:

\[h_t = [\,\overrightarrow{h_t}\,;\,\overleftarrow{h_t}\,]\]

In words: glue together what a forward reader knows after position \(t\) and what a backward reader knows up to position \(t\), so the combined state has seen the entire sequence. Also written: \(h_t = \overrightarrow{h_t} \,\Vert\, \overleftarrow{h_t}\), where \(\Vert\) denotes concatenation; if each direction has width \(d\), the combined \(h_t\) has width \(2d\).

flowchart LR
    subgraph Forward
      f1((→h₁)) --> f2((→h₂)) --> f3((→h₃))
    end
    subgraph Backward
      b3((←h₃)) --> b2((←h₂)) --> b1((←h₁))
    end
    f1 & b1 --> o1["y₁"]
    f2 & b2 --> o2["y₂"]
    f3 & b3 --> o3["y₃"]

The catch: a BiRNN needs the entire sequence before it can emit any output, so it cannot be used for real-time/streaming tasks (live transcription, next-token generation). It shines in offline labeling — named-entity recognition, part-of-speech tagging — where the full input is available.

Deep (stacked) RNNs add representational depth by stacking layers: the hidden-state sequence from layer 1 becomes the input sequence to layer 2, and so on. Lower layers tend to capture short, local patterns; higher layers capture longer, more abstract structure — the temporal analog of stacking convolutions for spatial hierarchy. Two or three layers is typical; beyond that, gains shrink and training gets fragile.

Bidirectionality and depth combine freely, and that combination — the stacked BiLSTM — was the workhorse of pre-Transformer NLP. The picture below draws it directly: at each time step two LSTM cells (forward ▶ and backward ◀) read the input, their outputs are concatenated, and that concatenation feeds a second bidirectional layer above. Read horizontally to follow each direction’s recurrence; read vertically to follow depth.

A stacked BiLSTM is, again, one constructor call — bidirectional=True builds both directions and concatenates them for you:

import torch.nn as nn
bilstm = nn.LSTM(input_size=100, hidden_size=128, num_layers=2,
                 batch_first=True, bidirectional=True)
# output width = 2 * hidden_size = 256 (forward ‖ backward), per time step

Tip

Rule of thumb: use bidirectional when you have the whole sequence up front and need maximum accuracy; use unidirectional when you must produce outputs as data streams in. Stack 2–3 layers before reaching for anything fancier.

16.6 — Sequence-to-Sequence / Encoder–Decoder

Many tasks map an input sequence to an output sequence of a different length: translation (English → French), summarization (long → short), speech-to-text. The sequence-to-sequence (seq2seq) architecture (Sutskever et al., 2014) handles this with two RNNs.

The encoder reads the entire input and compresses it into a single fixed-length vector — the context vector \(c\), usually its final hidden state. The decoder is a second RNN initialized from \(c\) that generates the output one token at a time, each step feeding its own previous output back in as the next input, until it emits a special end-of-sequence token.

flowchart LR
    x1["Le"] --> e1
    x2["chat"] --> e2
    x3["dort"] --> e3
    subgraph Encoder
      e1((h₁)) --> e2((h₂)) --> e3((h₃))
    end
    subgraph Decoder
      d1((s₁)) --> d2((s₂)) --> d3((s₃))
    end
    e3 -->|context c| d1
    d1 --> y1["The"]
    d2 --> y2["cat"]
    d3 --> y3["sleeps"]

Teacher forcing. A practical wrinkle: at training time, what does the decoder feed back as the “previous output”? If you feed back the decoder’s own prediction and it is wrong early on, the error compounds and training barely moves. Teacher forcing instead feeds back the ground-truth previous token during training, so the decoder always conditions on a correct prefix and learns fast. The mismatch — train on gold prefixes, test on self-generated ones — is called exposure bias, and is usually softened by mixing the two (sometimes feeding the model’s own prediction with some probability).

# one decoder step with teacher forcing (training)
dec_input = target[:, t-1]          # ground-truth previous token, not the prediction
emb = embed(dec_input)
out, hidden = decoder(emb, hidden)
logits_t = out_head(out)            # compared against target[:, t] in the loss

Worked intuition — and the bottleneck. To translate “Le chat dort” → “The cat sleeps,” the encoder squeezes all three French words into one vector \(c\); the decoder unpacks it into three English words. This works for short sentences. But notice the structural flaw: every input sentence — three words or fifty — must be crushed into the same fixed-size \(c\). This is the fixed-context bottleneck.

As sentences grow, this single vector becomes a crippling information funnel — the decoder, especially when producing early output words, can no longer “see” the relevant parts of a long input, and translation quality collapses on long sentences. The fix was to let the decoder look back at all encoder hidden states and learn which to focus on at each step. That mechanism is attention, and removing the recurrence entirely while keeping attention gives the Transformer — the subject of the next chapter, and the architecture that made RNNs largely optional.

Warning

Common mistake: scaling a vanilla seq2seq to long documents and blaming the RNN cell. The limiter is usually the fixed-size context vector, not the LSTM/GRU inside it. The cure is attention (Chapter 17), not a bigger hidden state.

16.7 — Decoding: Turning Probabilities into Sequences

A trained decoder does not hand you a sentence — at each step it hands you a probability distribution over the vocabulary. Turning that stream of distributions into one output sequence is decoding, and the strategy you choose matters as much as the model. Picture standing at the root of a branching tree: every node offers thousands of next-word choices, and you must walk to a leaf. You cannot try every path — a 20-word sentence over a 50,000-word vocabulary has \(50000^{20}\) of them.

Greedy decoding takes the single most probable token at each step and never looks back. It is fast and simple, but myopic: a high-probability first word can lock you into a low-probability sentence overall, the way grabbing the nearest exit can leave you on the slowest road home.

Beam search keeps the \(k\) best partial sequences (the “beam”) at every step instead of just one. At each step it expands all \(k\) candidates by every possible next token, scores the new partial sequences by their cumulative log-probability, and keeps only the top \(k\). It is a middle ground between greedy (\(k=1\)) and exhaustive search.

\[\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i \mid y_{<i}, x)\]

In words: a candidate sequence’s score is the sum of the log-probabilities of its tokens — how likely the model thinks the whole prefix is, added up step by step. Also written: \(\text{score} = \log \prod_i p(y_i\mid y_{<i}, x) = \sum_i \log p(y_i\mid y_{<i}, x)\) — we add logs instead of multiplying probabilities to avoid numeric underflow, and often divide by sequence length to stop the search from unfairly preferring short outputs.

flowchart LR
    s["start"] --> a["the (−0.2)"]
    s --> b["a (−1.1)"]
    a --> a1["cat (−0.5)"]
    a --> a2["dog (−0.9)"]
    b --> b1["car (−1.4)"]
    a1 --> keep1["beam keeps top-2"]
    a2 --> keep1
    b1 --> keep1

Sampling is the alternative when you want variety rather than the single most likely answer — chatbots, story generation, brainstorming. Instead of always taking the top token, you draw from the distribution. Two knobs tame it:

Temperature \(T\) rescales the logits before the softmax: \(T<1\) sharpens toward the top token (safer, more repetitive), \(T>1\) flattens the distribution (more surprising, more error-prone).
Top-k / nucleus (top-p) sampling restrict the draw to only the most probable tokens — top-k to a fixed count, top-p to the smallest set whose probabilities sum to \(p\) — so you keep variety without ever sampling an absurd word from the long tail.

\[p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]

In words: divide every logit by the temperature before softmax; small \(T\) makes the biggest logit dominate, large \(T\) evens the field. Also written: at \(T\to 0\) this collapses to \(\arg\max_i z_i\) (greedy); at \(T=1\) it is the model’s raw distribution.

The animation below shows the same three logits under a cooling and then warming temperature: low \(T\) spikes the favorite, high \(T\) levels the field.

import torch, torch.nn.functional as F
def sample_next(logits, temperature=1.0, top_p=0.9):
    logits = logits / temperature
    probs = F.softmax(logits, dim=-1)
    sorted_probs, idx = torch.sort(probs, descending=True)
    keep = torch.cumsum(sorted_probs, dim=-1) <= top_p   # nucleus: smallest set ≥ top_p
    keep[..., 0] = True                                  # always keep the top token
    sorted_probs[~keep] = 0
    choice = torch.multinomial(sorted_probs, 1)          # draw from the truncated set
    return idx.gather(-1, choice)

Tip

Rule of thumb: for tasks with one right answer (translation, transcription) use beam search; for open-ended generation (dialogue, stories) use temperature + nucleus sampling. Greedy is the fast baseline you reach for first and rarely keep.

16.8 — Regularizing Recurrent Nets

Recurrent nets overfit eagerly — they have many parameters reused over many steps, so a small training set is quickly memorized. The fixes are mostly the same tools as feedforward nets, but with one twist that trips people up: where and how you apply them on the time axis changes whether they help or hurt.

Dropout, applied correctly. Plain dropout drops a different random set of units at every time step. On a recurrent connection that is destructive — it injects fresh noise into the memory at each step, so the signal you are trying to carry across 50 steps is corrupted 50 times. Variational (or “locked”) dropout fixes this by sampling one dropout mask per sequence and reusing it at every time step, so the same units are dropped throughout — noise that the network can actually learn around. The contrast is easy to see: standard dropout zaps a different scatter of units at each step (left), locked dropout fixes the pattern for the whole sequence (right).

In practice, apply ordinary dropout freely between stacked layers (the vertical connections) and use the locked variant on the recurrent (horizontal) connection. PyTorch’s dropout= argument on nn.LSTM applies between layers only — exactly the safe place — and applies nothing if there is just one layer.

import torch.nn as nn
# dropout=0.3 acts BETWEEN the two layers; the recurrent path is left clean.
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2,
               batch_first=True, dropout=0.3)

Layer normalization, not batch norm. Batch normalization (Chapter 15) struggles inside RNNs: statistics computed across the batch are unstable when sequences have different lengths and when the same cell is reused at every step. Layer normalization sidesteps the problem by normalizing across the features of a single example at a single step — no dependence on batch size or position — which is why it became the default normalizer for recurrent and, later, Transformer models.

\[\hat h = \frac{h - \mu}{\sqrt{\sigma^2 + \epsilon}}\,\gamma + \beta, \qquad \mu = \frac{1}{d}\sum_{j=1}^{d} h_j\]

In words: for each token’s hidden vector, subtract its own mean and divide by its own standard deviation, then re-scale and re-shift with learned \(\gamma, \beta\). Also written: the statistics \(\mu, \sigma^2\) are taken over the \(d\) feature dimensions of one vector — contrast batch norm, which takes them over the batch dimension for one feature.

Other staples. Weight decay (L2) reins in parameter magnitudes; early stopping on validation loss is cheap and effective given that RNNs often overfit before they finish converging; and zoneout is a recurrence-specific trick that randomly forces some hidden units to keep their previous value instead of updating — dropout’s cousin that, fittingly, preserves rather than zeros memory.

Warning

Common mistake: slapping standard dropout on the recurrent connection and wondering why long-range memory got worse. Reach for variational/locked dropout there, or keep dropout to the between-layer connections only.

16.9 — Convolutional & Attention Alternatives for Sequences

Recurrence is not the only way to handle order. Its defining weakness is that it is inherently sequential — step \(t\) cannot be computed until step \(t-1\) is done — so it cannot exploit the parallelism of modern GPUs and is slow to train on long sequences. Two families sidestep this, and together they explain why pure RNNs faded.

Temporal (1D) convolutions. A 1D convolution slides a small filter along the time axis, just as a 2D convolution slides over an image (Chapter 15). It reads a fixed window of recent steps in parallel — no hidden state threaded through time — so the whole sequence is processed at once. The intuition: instead of one reader passing notes forward, imagine many readers, each assigned a short window, all working simultaneously.

To reach far back without an enormous filter, dilated convolutions insert gaps between the filter taps, so each layer skips ahead and the receptive field grows exponentially with depth. Stacking dilations \(1, 2, 4, 8, \dots\) lets a modest stack of layers see thousands of steps back — the core idea behind WaveNet for raw-audio generation and Temporal Convolutional Networks (TCNs) for forecasting.

\[\text{receptive field} = 1 + (k - 1)\sum_{\ell} d_\ell\]

In words: the span a stacked dilated conv can “see” grows with the filter size \(k\) and the sum of the per-layer dilation factors \(d_\ell\) — double the dilation each layer and the reach doubles too. Also written: with \(L\) layers, filter size \(k\), and dilations \(d_\ell = 2^{\ell-1}\), the field is \(1 + (k-1)(2^{L}-1)\) — exponential in depth.

Worked example — reach in numbers. With filter size \(k=2\) and three stacked layers at dilations \(1, 2, 4\), the receptive field is \(1 + (2-1)(1+2+4) = 1 + 7 = 8\) steps. Add one more layer at dilation \(8\) and it jumps to \(1 + (1)(1+2+4+8) = 16\). Each extra layer doubles the reach — so just 10 layers already see past \(1000\) steps, something a vanilla RNN could never hold in memory.

import torch.nn as nn
# causal-ish dilated 1D conv stack (TCN flavour); pad to keep length, dilate to reach back
conv = nn.Sequential(
    nn.Conv1d(64, 64, kernel_size=3, padding=2, dilation=2), nn.ReLU(),
    nn.Conv1d(64, 64, kernel_size=3, padding=4, dilation=4), nn.ReLU(),
)
x = torch.randn(8, 64, 200)   # (batch, channels, time) — note channels-first for Conv1d

Self-attention and the Transformer. The other escape is to drop recurrence entirely and let every position look directly at every other position in one parallel operation — self-attention. This solves the long-range problem (any two positions are one hop apart, not \(T\) steps apart) and the parallelism problem (all positions computed at once). The cost is quadratic in sequence length and a loss of the built-in notion of order, which is restored with positional encodings. This is the Transformer, and it is the direct sequel to the fixed-context bottleneck of §16.6 — covered in full in Chapter 17.

Approach	Parallel over time?	Long-range reach	Built-in order?
RNN / LSTM / GRU	no (sequential)	good (gated)	yes (recurrence)
Dilated 1D conv (TCN)	yes	grows with depth	yes (local windows)
Self-attention (Transformer)	yes	direct, any-to-any	no — added via positional encoding

Tip

When to still use an RNN: truly streaming/online settings (decide at each step with low latency), very long sequences where attention’s quadratic cost bites, and small-data regimes where a compact GRU generalizes better than a data-hungry Transformer. Recurrence is not obsolete — it is a specialist.

16.10 — Worked End-to-End: A Character-Level Language Model

To tie the chapter together, here is the smallest complete recurrent model that does something real: predict the next character of text. It is the canonical “hello world” of RNNs, and every piece below maps to a section above — embedding, a gated cell (§16.3), a many-to-many readout (§16.2), and sampling for generation (§16.7).

The task: given the characters seen so far, output a distribution over the next character. Train it on any corpus and it learns spelling, then words, then crude grammar — purely from the running hidden state.

import torch, torch.nn as nn, torch.nn.functional as F

class CharLM(nn.Module):
    def __init__(self, vocab, emb=64, hid=256, layers=2):
        super().__init__()
        self.embed = nn.Embedding(vocab, emb)
        self.lstm  = nn.LSTM(emb, hid, num_layers=layers,
                             batch_first=True, dropout=0.2)   # §16.8: between-layer dropout
        self.head  = nn.Linear(hid, vocab)                    # §16.2: one output per step
    def forward(self, x, state=None):
        e = self.embed(x)                                     # (B, T) -> (B, T, emb)
        out, state = self.lstm(e, state)                      # (B, T, hid)
        return self.head(out), state                          # logits over vocab, per step

# --- one training step (teacher forcing is implicit: targets are inputs shifted by one) ---
model = CharLM(vocab=96)
opt = torch.optim.Adam(model.parameters(), lr=3e-3)
x, y = batch[:, :-1], batch[:, 1:]            # predict char t+1 from chars up to t
logits, _ = model(x)
loss = F.cross_entropy(logits.reshape(-1, 96), y.reshape(-1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)   # §16.2: tame exploding grads
opt.step(); opt.zero_grad()

# --- generation: feed back the model's own samples, carrying the state forward ---
@torch.no_grad()
def generate(model, start_ids, steps=200, temperature=0.8):
    ids, state = start_ids, None
    out = list(start_ids[0].tolist())
    for _ in range(steps):
        logits, state = model(ids, state)                 # carry hidden state across steps
        probs = F.softmax(logits[:, -1] / temperature, -1)
        nxt = torch.multinomial(probs, 1)                 # §16.7: temperature sampling
        out.append(nxt.item()); ids = nxt
    return out

Three things to notice, each a callback to the chapter. Training uses teacher forcing for free — the target sequence is just the input shifted by one, so the loss at step \(t\) always conditions on the true prefix. Generation, by contrast, feeds the model’s own samples back in (the exposure-bias gap of §16.6), which is exactly why temperature matters there and not in training. And the hidden state is threaded out of generate and back in on every call — the same running memory of §16.1, now used at inference time to keep the model coherent across hundreds of characters.

Note

Why this still teaches the field: scale this exact recipe up — bigger hidden size, more data, and (the one architectural swap) self-attention in place of the LSTM — and you have the skeleton of a modern language model. The training loop, the shifted targets, the sampling-with-temperature at generation time are unchanged. RNNs are where every one of those ideas was first made to work.

16.11 — Quick reference

Term / formula	Meaning in one line	When / why it matters
Hidden state \(h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b)\)	running memory updated by shared weights each step	the core RNN recurrence; same weights at every \(t\)
BPTT, \(\nabla_{W_{hh}}L = \sum_t \nabla_{W_{hh}}L_t\)	unroll, then sum the shared weight’s gradient over all steps	how RNNs train; source of vanishing/exploding gradients
Vanishing / exploding gradient	factor \(\approx w^T\) shrinks to 0 or blows up over \(T\) steps	why plain RNNs forget past ~10–20 steps
Truncated BPTT(\(k\))	carry \(h\) forward (detached), backprop only \(k\) steps	train on long streams without exploding memory
Gradient clipping \(g \leftarrow g\cdot\tfrac{\tau}{\max(\\|g\\|,\tau)}\)	cap the gradient norm at \(\tau\), keep direction	the standard fix for exploding gradients
LSTM cell \(c_t = f_t\odot c_{t-1} + i_t\odot\tilde c_t\)	additive memory with forget/input/output gates	long-range memory; \(f_t{=}1,i_t{=}0\) copies a fact verbatim
Forget-gate bias \(b_f \approx 1\)–\(2\)	start gates near “remember”	faster convergence on long-dependency tasks
GRU \(h_t = (1-z_t)h_{t-1} + z_t\tilde h_t\)	one update gate \(z_t\) toggles keep vs. overwrite	lighter LSTM; strong default, fewer params
Bidirectional \(h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]\)	concatenate forward + backward readers	offline tasks (NER, POS) where future helps
Stacked (deep) RNN	layer \(\ell\)’s output sequence feeds layer \(\ell{+}1\)	builds temporal hierarchy; 2–3 layers typical
Seq2seq + context vector \(c\)	encoder compresses input → decoder unpacks output	variable-length mapping (translation); \(c\) is the bottleneck
Teacher forcing	feed ground-truth previous token while training	fast training; mind the exposure-bias gap at test
Beam search, \(\sum_i \log p(y_i\mid y_{<i},x)\)	keep top-\(k\) partial sequences by log-prob	one-right-answer decoding (translation, ASR)
Temperature / nucleus, \(p_i = \tfrac{e^{z_i/T}}{\sum_j e^{z_j/T}}\)	reshape distribution before sampling	open-ended generation (dialogue, stories)
Variational (locked) dropout	one dropout mask reused across all steps	safe regularizer for the recurrent path
Layer norm	normalize across features of one step	RNN/Transformer default; batch-size independent
Dilated 1D conv, field \(=1+(k-1)\sum_\ell d_\ell\)	gaps in the filter grow reach exponentially	parallel, long-reach alternative (WaveNet, TCN)
Self-attention	every position attends to every other in parallel	removes the bottleneck; the Transformer (Ch. 17)

16.12 — Key takeaways

An RNN carries a hidden state updated by the same weights at every step; unrolled, it is a deep weight-shared network trained by backpropagation through time.
Vanilla RNNs suffer vanishing/exploding gradients — in the scalar case the recurrent factor raised to the sequence length — so they forget dependencies beyond ~10–20 steps. Vanishing is fixed by architecture (gated cells); exploding is fixed by training (gradient clipping).
Sequence tasks come in four shapes (one-to-many, many-to-one, many-to-many aligned, seq2seq); the cell is identical — only which states you read out and feed back changes. Long streams are trained with truncated BPTT, carrying the hidden state forward but backpropagating only within a chunk.
The LSTM adds a cell state with forget/input/output gates; with \(f_t=1, i_t=0\) a memory is copied verbatim, and that additive path lets gradients survive across hundreds of steps. A small positive forget-gate bias makes it default to remembering.
The GRU is a lighter LSTM with two gates and no separate cell state; its update gate \(z_t\) alone toggles between keep (\(z_t\!\approx\!0\)) and overwrite (\(z_t\!\approx\!1\)) — a strong, cheaper default. Let validation, not gate count, choose between them.
Bidirectional RNNs read both directions (offline only); deep/stacked RNNs build temporal hierarchy. A stacked BiLSTM was the pre-Transformer NLP workhorse.
Seq2seq encoder–decoder maps variable-length input to variable-length output (trained with teacher forcing), but its single fixed-context vector is a bottleneck on long inputs — the problem that motivated attention.
Decoding turns per-step distributions into sequences: greedy and beam search for one-right-answer tasks, temperature + nucleus sampling for open-ended generation.
Regularize recurrent nets with variational/locked dropout on the recurrent path (or plain dropout between stacked layers) and layer norm rather than batch norm.
Recurrence is sequential and slow; dilated 1D convolutions (WaveNet, TCN) and self-attention (Transformers) parallelize over time and reach further, leaving RNNs a specialist tool for streaming, very long, or small-data settings.

16.13 — See also

Neural Networks (Core) — backpropagation, activations, and the feedforward machinery RNNs unroll into.
Convolutional Neural Networks — the spatial counterpart to weight sharing across time, and the source of the 1D/dilated convolutions used for sequences.
Attention & Transformers — how attention removes the fixed-context bottleneck and supersedes recurrence.
Natural Language Processing — tagging, translation, and the tasks RNNs and seq2seq were built for.
Large Language Models — where the Transformer successors to these models now live.
Time Series & Forecasting — applying recurrent and sequence models to numeric temporal data.
Speech & Audio Processing — streaming vs. offline recognition, a classic BiRNN/seq2seq domain.

↪ The thread continues → Chapter 17 · ⚡ Attention & Transformers

Recurrent nets pass information one step at a time and forget. The fix — letting every position look directly at every other — is attention, and the Transformer it powers is the spine of all that follows.

📖 All chapters | ← 15 · 🖼️ Convolutional Neural Networks | 17 · ⚡ Attention & Transformers →