flowchart LR
x1["x₁"] --> A1((h₁))
x2["x₂"] --> A2((h₂))
x3["x₃"] --> A3((h₃))
h0["h₀"] --> A1
A1 -->|W_hh| A2
A2 -->|W_hh| A3
A1 --> y1["y₁"]
A2 --> y2["y₂"]
A3 --> y3["y₃"]
Chapter 16 — 🔁 Recurrent & Sequence Models
📖 All chapters | ← 15 · 🖼️ Convolutional Neural Networks | 17 · ⚡ Attention & Transformers →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
Most data in the world arrives as a sequence: words in a sentence, samples of audio, daily sensor readings, frames of video. A plain feedforward network sees a fixed-size vector and forgets everything between examples — it has no notion of “before” and “after.” Recurrent and sequence models fix this by giving the network a memory that carries information forward as it reads a sequence one step at a time. They are the bridge between classical neural nets and the attention-based Transformers that now dominate language and vision, and understanding them is the cleanest way to understand why attention was invented.
🧭 In context: Deep Learning · models for ordered/variable-length data (text, audio, time series) · the one key idea: a recurring hidden state that summarizes the past so the same weights can be applied at every time step.
💡 Remember this: a recurrent net is a
forloop with shared weights — one hidden state carries the past forward — and every advance (gating, attention, Transformers) exists to fix how that single thread of memory fails over long sequences.
16.1 — Recurrent Neural Networks
A recurrent neural network (RNN) processes a sequence \(x_1, x_2, \dots, x_T\) one element at a time, maintaining a hidden state \(h_t\) — a vector that acts as the network’s running memory of everything it has seen so far. At each step it combines the new input with the previous memory to produce a new memory:
\[h_t = \tanh(W_{xh}\,x_t + W_{hh}\,h_{t-1} + b_h), \qquad y_t = W_{hy}\,h_t + b_y\]
In words: the new memory is a squashed mix of this step’s input and last step’s memory; the output is a linear read-out of that memory. Also written: \(h_t = \tanh\!\big(W_{xh} x_t + W_{hh} h_{t-1} + b_h\big)\) is often compressed to \(h_t = \tanh\!\big(W[h_{t-1}, x_t] + b\big)\), where \(W = [\,W_{hh}\;\;W_{xh}\,]\) is the two matrices stacked side by side and \([h_{t-1}, x_t]\) is the two vectors concatenated.
The crucial detail is that the same weight matrices \(W_{xh}, W_{hh}, W_{hy}\) are reused at every time step. This weight sharing is what lets an RNN handle sequences of any length with a fixed number of parameters — exactly the way a convolutional filter (Chapter 15) shares weights across space, an RNN shares weights across time.
Think of reading a sentence. Your understanding after word 5 (\(h_5\)) depends on word 5 plus your understanding after word 4 (\(h_4\)). You apply the same “reading process” to every word; the running interpretation is the hidden state.
Unrolling through time. A single recurrent cell with a self-loop is hard to reason about. We unroll it: draw one copy of the cell per time step, with the hidden state threaded from each copy to the next. The unrolled picture is just a very deep feedforward network whose layers happen to share weights.
Worked example — counting with a tiny RNN. Let the hidden state be a single scalar, \(W_{hh}=1\), \(W_{xh}=1\), \(b_h=0\), and drop the nonlinearity. Feed the inputs \(x = [2, 3, 5]\) starting from \(h_0 = 0\):
- \(h_1 = 1\cdot 2 + 1\cdot 0 = 2\)
- \(h_2 = 1\cdot 3 + 1\cdot 2 = 5\)
- \(h_3 = 1\cdot 5 + 1\cdot 5 = 10\)
The hidden state has learned to accumulate a running sum — a memory of the whole prefix in one number. Different weights would teach it to remember different things.
import numpy as np
# tiny RNN cell, one forward pass over a sequence
def rnn_forward(xs, Wxh, Whh, Why, bh, by):
h = np.zeros(Whh.shape[0]) # h0 = zeros
ys = []
for x in xs: # one step per element
h = np.tanh(Wxh @ x + Whh @ h + bh) # update memory
ys.append(Why @ h + by) # emit output
return np.array(ys), hThe same cell exists ready-made in every framework. In PyTorch the whole loop above is one module — you hand it the full sequence and it unrolls internally:
import torch, torch.nn as nn
rnn = nn.RNN(input_size=8, hidden_size=16, num_layers=1, batch_first=True)
x = torch.randn(4, 10, 8) # (batch, time, features)
outputs, h_n = rnn(x) # outputs: (4, 10, 16); h_n: last hidden stateBackpropagation through time (BPTT). To train, we unroll the network across the whole sequence and run ordinary backpropagation on that deep graph. Because \(W_{hh}\) is shared, its gradient is the sum of the contributions from every time step:
\[\frac{\partial L}{\partial W_{hh}} = \sum_{t=1}^{T} \frac{\partial L_t}{\partial W_{hh}}\]
In words: the recurrent weight is used at every time step, so its total gradient is the pile-up of one gradient per step. Also written: \(\nabla_{W_{hh}} L = \sum_{t=1}^{T} \nabla_{W_{hh}} L_t\) — a sum over the shared parameter’s many uses, the same accumulation rule any tied/weight-shared parameter obeys.
This is BPTT, and it has one fatal weakness. To learn that step 1 matters for the loss at step \(T\), the gradient has to travel all the way back from \(T\) to \(1\). Every hop backward multiplies it by the same recurrent factor. Multiply one number by itself \(T\) times and only two things can happen:
- if that number is less than 1, the product shrinks toward zero — the gradient vanishes, and the RNN simply cannot feel that early inputs mattered;
- if it is greater than 1, the product blows up — the gradient explodes, and training destabilizes.
Concretely, in the one-dimensional case (hidden state is a scalar with recurrent weight \(w\)), the factor is \(w\cdot\tanh'\), and the gradient reaching step 1 is roughly that factor raised to the \(T\)-th power — exactly the runaway-or-die behavior above. In the full vector case the same story holds, governed by the largest singular value of \(W_{hh}\) instead of a single number. This one problem is the entire reason gated architectures (next sections) exist.
The two failure modes are easy to picture: trace one number back through the unrolled chain and watch it shrink to nothing or blow up.
Intuition: an RNN is a for loop with learnable weights. Picture the unrolled graph, not the self-loop — every confusing thing about RNNs (BPTT, vanishing gradients, truncation) is obvious once you see it as a very deep, weight-shared network.
Common mistake: treating long-range dependencies as “just a bit harder” for a vanilla RNN. In practice a plain \(\tanh\) RNN struggles to connect events more than ~10–20 steps apart. If your task needs longer memory, reach for an LSTM/GRU or attention rather than a deeper vanilla RNN.
16.2 — Sequence Task Shapes & Truncated BPTT
Before the gated cells, it helps to name what you are asking a recurrent net to do and how you actually train it on long streams — two practical pieces every textbook covers and that decide how you wire the model.
The four task shapes. RNNs are flexible because input and output lengths can differ independently. Karpathy’s classic picture sorts almost every sequence task into one of four patterns:
| Shape | Input → Output | Example | How to read the RNN |
|---|---|---|---|
| one-to-many | one vector → sequence | image → caption | feed input once, generate until a stop token |
| many-to-one | sequence → one vector | review → sentiment | run to the end, use the last hidden state |
| many-to-many (aligned) | sequence → same-length sequence | tag each word (POS) | emit one output per step |
| many-to-many (seq2seq) | sequence → different-length sequence | translate a sentence | encoder then decoder (§16.6) |
flowchart LR
subgraph one_to_many["one-to-many"]
a1["x"] --> a2["■■■■"]
end
subgraph many_to_one["many-to-one"]
b1["■■■■"] --> b2["y"]
end
subgraph aligned["many-to-many aligned"]
c1["■■■■"] --> c2["■■■■"]
end
The lesson is purely about plumbing: the cell is the same RNN/LSTM/GRU in every case — what changes is which hidden states you read out and feed back. A sentiment classifier ignores all outputs but the last; a tagger keeps every output; a caption generator feeds each emitted token back as the next input.
Truncated BPTT. A document or audio stream can be tens of thousands of steps long. Unrolling that whole thing to compute gradients would blow up memory and run impossibly slowly. Truncated BPTT chops the stream into chunks of, say, 100 steps: you backpropagate gradients only within a chunk, but you carry the hidden state forward (detached from the graph) into the next chunk. So the forward memory still spans the whole sequence, while the backward gradient sees only one chunk.
The picture: the forward pass (top) flows unbroken across the whole stream, but each chunk’s backward pass (bottom) stops at the chunk boundary — gradients never cross the dashed “detach” lines.
\[\frac{\partial L}{\partial W_{hh}} \;\approx\; \sum_{t=T-k}^{T} \frac{\partial L_t}{\partial W_{hh}}\]
In words: instead of summing gradient contributions over all time steps, only sum over the last \(k\) — the recent window you actually unrolled. Also written: “TBPTT(\(k_1, k_2\))” — process \(k_1\) new steps then backpropagate \(k_2\) steps back; the simple version above takes \(k_1 = k_2 = k\).
# truncated BPTT: carry h across chunks, but detach so grads stop at the chunk boundary
h = None
for chunk in stream_chunks(data, length=100): # each chunk is (batch, 100, feat)
if h is not None:
h = h.detach() # cut the graph; keep the values
out, h = rnn(chunk, h)
loss = criterion(head(out), targets_for(chunk))
loss.backward() # only unrolls within this chunk
optimizer.step(); optimizer.zero_grad()Gradient clipping is the standard partner fix for the exploding side of §16.1. Whenever the global gradient norm exceeds a threshold \(\tau\), rescale the whole gradient down so its norm is exactly \(\tau\) — direction preserved, magnitude capped:
\[g \leftarrow g \cdot \frac{\tau}{\max(\|g\|, \tau)}\]
In words: if the gradient is too long, shrink it back to length \(\tau\) without changing where it points; otherwise leave it alone. Also written: \(g \leftarrow \min\!\big(1, \tfrac{\tau}{\|g\|}\big)\, g\) — equivalent, and in PyTorch literally torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=tau).
Worked example — clipping in numbers. Say the threshold is \(\tau = 5\) and one backward pass produces a gradient with norm \(\|g\| = 20\). The scale factor is \(\tfrac{5}{\max(20,5)} = \tfrac{5}{20} = 0.25\), so every component is multiplied by \(0.25\) — the gradient now has norm exactly \(5\) and points in the same direction. A different step with \(\|g\| = 3\) gets factor \(\tfrac{5}{\max(3,5)} = 1\), so it passes through untouched. Clipping only ever shrinks the occasional spike; it never grows or rotates a healthy gradient.
Rule of thumb: vanishing gradients are an architecture problem (use a gated cell); exploding gradients are a training problem (clip the norm). Reach for the matching tool — clipping a vanilla RNN will not give it long memory.
16.3 — Long Short-Term Memory (LSTM)
The LSTM (Hochreiter & Schmidhuber, 1997) solves vanishing gradients with one elegant idea: alongside the hidden state, carry a cell state \(c_t\) — a memory “conveyor belt” that runs straight down the sequence with only minor, additive interactions. Because information moves along \(c_t\) mostly by addition rather than repeated matrix multiplication, gradients can flow across hundreds of steps without vanishing.
Three gates — small sigmoid networks producing values in \([0,1]\) — decide what happens to the cell state. A gate output near 0 means “block,” near 1 means “let through.”
- Forget gate \(f_t\): what to erase from the old cell state.
- Input gate \(i_t\): what new information to write.
- Output gate \(o_t\): what part of the cell to expose as the hidden state.
\[ \begin{aligned} f_t &= \sigma(W_f[h_{t-1}, x_t] + b_f) \\ i_t &= \sigma(W_i[h_{t-1}, x_t] + b_i) \\ \tilde c_t &= \tanh(W_c[h_{t-1}, x_t] + b_c) \\ c_t &= f_t \odot c_{t-1} + i_t \odot \tilde c_t \\ o_t &= \sigma(W_o[h_{t-1}, x_t] + b_o) \\ h_t &= o_t \odot \tanh(c_t) \end{aligned} \]
In words: decide how much old memory to keep (\(f_t\)) and how much new candidate to add (\(i_t\)); the cell becomes kept old + added new; the hidden state is a gated (\(o_t\)) window onto that cell. Also written: the central update \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\) is a leaky accumulator — equivalently \(c_t = c_{t-1} + \big(i_t \odot \tilde c_t - (1-f_t)\odot c_{t-1}\big)\), i.e. old value plus a correction, which is why the gradient path along \(c\) is near-identity when \(f_t \approx 1\).
Here \(\odot\) is element-wise multiplication and \([h_{t-1}, x_t]\) is the two vectors concatenated. The line \(c_t = f_t \odot c_{t-1} + i_t \odot \tilde c_t\) is the heart of it: keep a fraction of the old memory, add a fraction of the new candidate.
flowchart LR
cprev["c₍t-1₎"] --> MUL1["× (forget)"]
f["f_t σ"] --> MUL1
MUL1 --> ADD["+"]
i["i_t σ"] --> MUL2["×"]
cand["c̃_t tanh"] --> MUL2
MUL2 --> ADD
ADD --> cnext["c_t"]
cnext --> TANH["tanh"]
o["o_t σ"] --> MUL3["×"]
TANH --> MUL3
MUL3 --> hnext["h_t"]
Worked example — the gate that makes memory survive. Take a single cell dimension holding \(c_{t-1} = 0.8\) (say, the fact “subject is plural”). Suppose at this step the gates compute \(f_t = 1\), \(i_t = 0\), and the candidate is \(\tilde c_t = -0.5\). The cell update is
\[c_t = f_t\cdot c_{t-1} + i_t\cdot\tilde c_t = 1\cdot 0.8 + 0\cdot(-0.5) = 0.8.\]
The memory is copied verbatim — the incoming candidate is multiplied by zero and ignored, so nothing overwrites the stored fact. Run the same gates for ten more steps and \(c\) stays exactly \(0.8\) the whole way; the gradient along that path is multiplied by \(f_t = 1\) each time, so it neither vanishes nor explodes. Now flip the gates to \(f_t = 0\), \(i_t = 1\): then \(c_t = 0\cdot 0.8 + 1\cdot(-0.5) = -0.5\), a clean overwrite. Same machinery, opposite behavior — that selectable mix of “keep” and “write” is what a vanilla RNN cannot do.
To make the additive path concrete, here is the full cell as code:
import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))
def lstm_step(x, h, c, P): # P holds all weight matrices/biases
z = np.concatenate([h, x]) # [h_{t-1}, x_t]
f = sigmoid(P['Wf'] @ z + P['bf']) # forget
i = sigmoid(P['Wi'] @ z + P['bi']) # input
o = sigmoid(P['Wo'] @ z + P['bo']) # output
g = np.tanh (P['Wc'] @ z + P['bc']) # candidate
c = f*c + i*g # additive memory update
h = o*np.tanh(c)
return h, cIn production you never hand-roll this — a one-line nn.LSTM runs the optimized, cuDNN-backed version, and a tiny head turns the sequence into a prediction:
import torch, torch.nn as nn
class Tagger(nn.Module): # many-to-one: classify a whole sequence
def __init__(self, vocab, emb=64, hid=128, classes=2):
super().__init__()
self.embed = nn.Embedding(vocab, emb)
self.lstm = nn.LSTM(emb, hid, batch_first=True)
self.head = nn.Linear(hid, classes)
def forward(self, x): # x: (batch, time) token ids
e = self.embed(x)
out, (h_n, c_n) = self.lstm(e)
return self.head(h_n[-1]) # use the final hidden stateIn words, the worked sentence “The keys that I left on the table by the door … are on the floor” is exactly this trick at scale: the LSTM writes “plural” into a cell dimension via the input gate, holds the forget gate near 1 so it survives every intervening word untouched, then opens the output gate when the verb is needed. A vanilla RNN would have washed that signal out long before “are.”
A practical detail — the forget-gate bias. A standard trick is to initialize \(b_f\) to a small positive number (often \(1\) or \(2\)). That starts every forget gate near \(\sigma(1)\approx 0.73\), so the cell defaults to remembering and only learns to forget where forgetting helps. Without it an LSTM can spend the early epochs erasing its own memory before it learns to keep anything — a one-line change that noticeably speeds convergence on long-dependency tasks.
Intuition: the cell state is a notebook. The forget gate erases lines, the input gate writes new ones, the output gate decides which lines you read aloud right now. Addition (not multiplication) along that notebook is the trick that keeps gradients alive.
16.4 — Gated Recurrent Unit (GRU)
The GRU (Cho et al., 2014) is a streamlined LSTM: it merges the cell and hidden state into one vector and uses just two gates instead of three, giving fewer parameters and faster training with usually comparable accuracy.
- Update gate \(z_t\): interpolates between keeping the old state and adopting a new candidate (it fuses the LSTM’s forget and input gates into one knob).
- Reset gate \(r_t\): how much past state to use when forming the candidate.
\[ \begin{aligned} z_t &= \sigma(W_z[h_{t-1}, x_t]) \\ r_t &= \sigma(W_r[h_{t-1}, x_t]) \\ \tilde h_t &= \tanh(W_h[\,r_t \odot h_{t-1},\, x_t\,]) \\ h_t &= (1 - z_t)\odot h_{t-1} + z_t \odot \tilde h_t \end{aligned} \]
In words: one gate (\(z_t\)) decides how much of the new candidate to mix into the old state; another (\(r_t\)) decides how much past to use while building that candidate. Also written: the blend \(h_t = (1-z_t)\odot h_{t-1} + z_t\odot\tilde h_t\) is an exponential-moving-average update with a learned, per-dimension rate \(z_t\) — the same convex-combination form as \(\text{new} = (1-\alpha)\,\text{old} + \alpha\,\text{fresh}\).
The final line is a convex blend: when \(z_t \approx 0\) the unit copies the previous state verbatim (perfect long-term memory, gradient passes through untouched); when \(z_t \approx 1\) it overwrites with the fresh candidate. One gate, both behaviors.
flowchart LR
hprev["h₍t-1₎"] --> KEEP["× (1 − z_t)"]
z["z_t σ"] --> KEEP
z --> WRITE["× z_t"]
cand["h̃_t (uses r_t)"] --> WRITE
KEEP --> ADD["+"]
WRITE --> ADD
ADD --> hnext["h_t"]
Worked example — the one knob, both extremes. Take a single dimension with old state \(h_{t-1} = 0.9\) and a fresh candidate \(\tilde h_t = 0.1\). The update gate alone decides the outcome:
- \(z_t \approx 0\) (gate shut, remember): \(h_t = (1-0)\cdot 0.9 + 0\cdot 0.1 = 0.9\). The old value is carried forward untouched — long-term memory, gradient multiplied by \((1-z_t)=1\).
- \(z_t \approx 1\) (gate open, overwrite): \(h_t = (1-1)\cdot 0.9 + 1\cdot 0.1 = 0.1\). The past is discarded and the candidate takes over.
- \(z_t = 0.5\) (blend): \(h_t = 0.5\cdot 0.9 + 0.5\cdot 0.1 = 0.5\) — a smooth interpolation between the two.
This is the same keep-or-write choice the LSTM splits across its forget and input gates, here collapsed into a single number. The reset gate \(r_t\) plays a different role: it only affects how the candidate is built — set \(r_t \approx 0\) and the candidate \(\tilde h_t\) ignores the past entirely, letting the unit start a fresh representation when a sequence boundary arrives.
Swapping LSTM for GRU in a framework is a one-word change — same call shape, fewer parameters under the hood:
import torch.nn as nn
gru = nn.GRU(input_size=64, hidden_size=128, num_layers=2,
batch_first=True, dropout=0.2) # drop-in replacement for nn.LSTM| Property | Vanilla RNN | LSTM | GRU |
|---|---|---|---|
| Gates | 0 | 3 (f, i, o) | 2 (z, r) |
| Separate cell state | no | yes (\(c_t\)) | no |
| Params per unit | fewest | most | middle |
| Long-range memory | poor | excellent | very good |
| Typical use | toy / very short | long dependencies, ample data | strong default, smaller data |
The honest practical takeaway: GRU and LSTM perform similarly on most tasks. Start with a GRU (cheaper); switch to an LSTM if you need the extra capacity and have the data to train it.
Common mistake: assuming more gates always means more accuracy. On small or moderate datasets the GRU often matches or beats the LSTM precisely because it has fewer parameters to overfit. Let validation performance, not gate count, decide.
16.5 — Bidirectional & Deep RNNs
A standard RNN reads strictly left-to-right, so \(h_t\) knows the past but not the future. Yet in many tasks the future disambiguates the present: in “I left my bank …”, the right word — riverbank or financial — depends on what comes after. A bidirectional RNN (BiRNN) runs two independent RNNs, one forward and one backward, then concatenates their hidden states at each position so every output sees the whole sequence:
\[h_t = [\,\overrightarrow{h_t}\,;\,\overleftarrow{h_t}\,]\]
In words: glue together what a forward reader knows after position \(t\) and what a backward reader knows up to position \(t\), so the combined state has seen the entire sequence. Also written: \(h_t = \overrightarrow{h_t} \,\Vert\, \overleftarrow{h_t}\), where \(\Vert\) denotes concatenation; if each direction has width \(d\), the combined \(h_t\) has width \(2d\).
flowchart LR
subgraph Forward
f1((→h₁)) --> f2((→h₂)) --> f3((→h₃))
end
subgraph Backward
b3((←h₃)) --> b2((←h₂)) --> b1((←h₁))
end
f1 & b1 --> o1["y₁"]
f2 & b2 --> o2["y₂"]
f3 & b3 --> o3["y₃"]
The catch: a BiRNN needs the entire sequence before it can emit any output, so it cannot be used for real-time/streaming tasks (live transcription, next-token generation). It shines in offline labeling — named-entity recognition, part-of-speech tagging — where the full input is available.
Deep (stacked) RNNs add representational depth by stacking layers: the hidden-state sequence from layer 1 becomes the input sequence to layer 2, and so on. Lower layers tend to capture short, local patterns; higher layers capture longer, more abstract structure — the temporal analog of stacking convolutions for spatial hierarchy. Two or three layers is typical; beyond that, gains shrink and training gets fragile.
Bidirectionality and depth combine freely, and that combination — the stacked BiLSTM — was the workhorse of pre-Transformer NLP. The picture below draws it directly: at each time step two LSTM cells (forward ▶ and backward ◀) read the input, their outputs are concatenated, and that concatenation feeds a second bidirectional layer above. Read horizontally to follow each direction’s recurrence; read vertically to follow depth.
A stacked BiLSTM is, again, one constructor call — bidirectional=True builds both directions and concatenates them for you:
import torch.nn as nn
bilstm = nn.LSTM(input_size=100, hidden_size=128, num_layers=2,
batch_first=True, bidirectional=True)
# output width = 2 * hidden_size = 256 (forward ‖ backward), per time stepRule of thumb: use bidirectional when you have the whole sequence up front and need maximum accuracy; use unidirectional when you must produce outputs as data streams in. Stack 2–3 layers before reaching for anything fancier.
16.6 — Sequence-to-Sequence / Encoder–Decoder
Many tasks map an input sequence to an output sequence of a different length: translation (English → French), summarization (long → short), speech-to-text. The sequence-to-sequence (seq2seq) architecture (Sutskever et al., 2014) handles this with two RNNs.
The encoder reads the entire input and compresses it into a single fixed-length vector — the context vector \(c\), usually its final hidden state. The decoder is a second RNN initialized from \(c\) that generates the output one token at a time, each step feeding its own previous output back in as the next input, until it emits a special end-of-sequence token.
flowchart LR
x1["Le"] --> e1
x2["chat"] --> e2
x3["dort"] --> e3
subgraph Encoder
e1((h₁)) --> e2((h₂)) --> e3((h₃))
end
subgraph Decoder
d1((s₁)) --> d2((s₂)) --> d3((s₃))
end
e3 -->|context c| d1
d1 --> y1["The"]
d2 --> y2["cat"]
d3 --> y3["sleeps"]
Teacher forcing. A practical wrinkle: at training time, what does the decoder feed back as the “previous output”? If you feed back the decoder’s own prediction and it is wrong early on, the error compounds and training barely moves. Teacher forcing instead feeds back the ground-truth previous token during training, so the decoder always conditions on a correct prefix and learns fast. The mismatch — train on gold prefixes, test on self-generated ones — is called exposure bias, and is usually softened by mixing the two (sometimes feeding the model’s own prediction with some probability).
# one decoder step with teacher forcing (training)
dec_input = target[:, t-1] # ground-truth previous token, not the prediction
emb = embed(dec_input)
out, hidden = decoder(emb, hidden)
logits_t = out_head(out) # compared against target[:, t] in the lossWorked intuition — and the bottleneck. To translate “Le chat dort” → “The cat sleeps,” the encoder squeezes all three French words into one vector \(c\); the decoder unpacks it into three English words. This works for short sentences. But notice the structural flaw: every input sentence — three words or fifty — must be crushed into the same fixed-size \(c\). This is the fixed-context bottleneck.
As sentences grow, this single vector becomes a crippling information funnel — the decoder, especially when producing early output words, can no longer “see” the relevant parts of a long input, and translation quality collapses on long sentences. The fix was to let the decoder look back at all encoder hidden states and learn which to focus on at each step. That mechanism is attention, and removing the recurrence entirely while keeping attention gives the Transformer — the subject of the next chapter, and the architecture that made RNNs largely optional.
Common mistake: scaling a vanilla seq2seq to long documents and blaming the RNN cell. The limiter is usually the fixed-size context vector, not the LSTM/GRU inside it. The cure is attention (Chapter 17), not a bigger hidden state.
16.7 — Decoding: Turning Probabilities into Sequences
A trained decoder does not hand you a sentence — at each step it hands you a probability distribution over the vocabulary. Turning that stream of distributions into one output sequence is decoding, and the strategy you choose matters as much as the model. Picture standing at the root of a branching tree: every node offers thousands of next-word choices, and you must walk to a leaf. You cannot try every path — a 20-word sentence over a 50,000-word vocabulary has \(50000^{20}\) of them.
Greedy decoding takes the single most probable token at each step and never looks back. It is fast and simple, but myopic: a high-probability first word can lock you into a low-probability sentence overall, the way grabbing the nearest exit can leave you on the slowest road home.
Beam search keeps the \(k\) best partial sequences (the “beam”) at every step instead of just one. At each step it expands all \(k\) candidates by every possible next token, scores the new partial sequences by their cumulative log-probability, and keeps only the top \(k\). It is a middle ground between greedy (\(k=1\)) and exhaustive search.
\[\text{score}(y_{1:t}) = \sum_{i=1}^{t} \log p(y_i \mid y_{<i}, x)\]
In words: a candidate sequence’s score is the sum of the log-probabilities of its tokens — how likely the model thinks the whole prefix is, added up step by step. Also written: \(\text{score} = \log \prod_i p(y_i\mid y_{<i}, x) = \sum_i \log p(y_i\mid y_{<i}, x)\) — we add logs instead of multiplying probabilities to avoid numeric underflow, and often divide by sequence length to stop the search from unfairly preferring short outputs.
flowchart LR
s["start"] --> a["the (−0.2)"]
s --> b["a (−1.1)"]
a --> a1["cat (−0.5)"]
a --> a2["dog (−0.9)"]
b --> b1["car (−1.4)"]
a1 --> keep1["beam keeps top-2"]
a2 --> keep1
b1 --> keep1
Sampling is the alternative when you want variety rather than the single most likely answer — chatbots, story generation, brainstorming. Instead of always taking the top token, you draw from the distribution. Two knobs tame it:
- Temperature \(T\) rescales the logits before the softmax: \(T<1\) sharpens toward the top token (safer, more repetitive), \(T>1\) flattens the distribution (more surprising, more error-prone).
- Top-k / nucleus (top-p) sampling restrict the draw to only the most probable tokens — top-k to a fixed count, top-p to the smallest set whose probabilities sum to \(p\) — so you keep variety without ever sampling an absurd word from the long tail.
\[p_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}\]
In words: divide every logit by the temperature before softmax; small \(T\) makes the biggest logit dominate, large \(T\) evens the field. Also written: at \(T\to 0\) this collapses to \(\arg\max_i z_i\) (greedy); at \(T=1\) it is the model’s raw distribution.
The animation below shows the same three logits under a cooling and then warming temperature: low \(T\) spikes the favorite, high \(T\) levels the field.
import torch, torch.nn.functional as F
def sample_next(logits, temperature=1.0, top_p=0.9):
logits = logits / temperature
probs = F.softmax(logits, dim=-1)
sorted_probs, idx = torch.sort(probs, descending=True)
keep = torch.cumsum(sorted_probs, dim=-1) <= top_p # nucleus: smallest set ≥ top_p
keep[..., 0] = True # always keep the top token
sorted_probs[~keep] = 0
choice = torch.multinomial(sorted_probs, 1) # draw from the truncated set
return idx.gather(-1, choice)Rule of thumb: for tasks with one right answer (translation, transcription) use beam search; for open-ended generation (dialogue, stories) use temperature + nucleus sampling. Greedy is the fast baseline you reach for first and rarely keep.
16.8 — Regularizing Recurrent Nets
Recurrent nets overfit eagerly — they have many parameters reused over many steps, so a small training set is quickly memorized. The fixes are mostly the same tools as feedforward nets, but with one twist that trips people up: where and how you apply them on the time axis changes whether they help or hurt.
Dropout, applied correctly. Plain dropout drops a different random set of units at every time step. On a recurrent connection that is destructive — it injects fresh noise into the memory at each step, so the signal you are trying to carry across 50 steps is corrupted 50 times. Variational (or “locked”) dropout fixes this by sampling one dropout mask per sequence and reusing it at every time step, so the same units are dropped throughout — noise that the network can actually learn around. The contrast is easy to see: standard dropout zaps a different scatter of units at each step (left), locked dropout fixes the pattern for the whole sequence (right).
In practice, apply ordinary dropout freely between stacked layers (the vertical connections) and use the locked variant on the recurrent (horizontal) connection. PyTorch’s dropout= argument on nn.LSTM applies between layers only — exactly the safe place — and applies nothing if there is just one layer.
import torch.nn as nn
# dropout=0.3 acts BETWEEN the two layers; the recurrent path is left clean.
lstm = nn.LSTM(input_size=128, hidden_size=256, num_layers=2,
batch_first=True, dropout=0.3)Layer normalization, not batch norm. Batch normalization (Chapter 15) struggles inside RNNs: statistics computed across the batch are unstable when sequences have different lengths and when the same cell is reused at every step. Layer normalization sidesteps the problem by normalizing across the features of a single example at a single step — no dependence on batch size or position — which is why it became the default normalizer for recurrent and, later, Transformer models.
\[\hat h = \frac{h - \mu}{\sqrt{\sigma^2 + \epsilon}}\,\gamma + \beta, \qquad \mu = \frac{1}{d}\sum_{j=1}^{d} h_j\]
In words: for each token’s hidden vector, subtract its own mean and divide by its own standard deviation, then re-scale and re-shift with learned \(\gamma, \beta\). Also written: the statistics \(\mu, \sigma^2\) are taken over the \(d\) feature dimensions of one vector — contrast batch norm, which takes them over the batch dimension for one feature.
Other staples. Weight decay (L2) reins in parameter magnitudes; early stopping on validation loss is cheap and effective given that RNNs often overfit before they finish converging; and zoneout is a recurrence-specific trick that randomly forces some hidden units to keep their previous value instead of updating — dropout’s cousin that, fittingly, preserves rather than zeros memory.
Common mistake: slapping standard dropout on the recurrent connection and wondering why long-range memory got worse. Reach for variational/locked dropout there, or keep dropout to the between-layer connections only.
16.9 — Convolutional & Attention Alternatives for Sequences
Recurrence is not the only way to handle order. Its defining weakness is that it is inherently sequential — step \(t\) cannot be computed until step \(t-1\) is done — so it cannot exploit the parallelism of modern GPUs and is slow to train on long sequences. Two families sidestep this, and together they explain why pure RNNs faded.
Temporal (1D) convolutions. A 1D convolution slides a small filter along the time axis, just as a 2D convolution slides over an image (Chapter 15). It reads a fixed window of recent steps in parallel — no hidden state threaded through time — so the whole sequence is processed at once. The intuition: instead of one reader passing notes forward, imagine many readers, each assigned a short window, all working simultaneously.
To reach far back without an enormous filter, dilated convolutions insert gaps between the filter taps, so each layer skips ahead and the receptive field grows exponentially with depth. Stacking dilations \(1, 2, 4, 8, \dots\) lets a modest stack of layers see thousands of steps back — the core idea behind WaveNet for raw-audio generation and Temporal Convolutional Networks (TCNs) for forecasting.
\[\text{receptive field} = 1 + (k - 1)\sum_{\ell} d_\ell\]
In words: the span a stacked dilated conv can “see” grows with the filter size \(k\) and the sum of the per-layer dilation factors \(d_\ell\) — double the dilation each layer and the reach doubles too. Also written: with \(L\) layers, filter size \(k\), and dilations \(d_\ell = 2^{\ell-1}\), the field is \(1 + (k-1)(2^{L}-1)\) — exponential in depth.
Worked example — reach in numbers. With filter size \(k=2\) and three stacked layers at dilations \(1, 2, 4\), the receptive field is \(1 + (2-1)(1+2+4) = 1 + 7 = 8\) steps. Add one more layer at dilation \(8\) and it jumps to \(1 + (1)(1+2+4+8) = 16\). Each extra layer doubles the reach — so just 10 layers already see past \(1000\) steps, something a vanilla RNN could never hold in memory.
import torch.nn as nn
# causal-ish dilated 1D conv stack (TCN flavour); pad to keep length, dilate to reach back
conv = nn.Sequential(
nn.Conv1d(64, 64, kernel_size=3, padding=2, dilation=2), nn.ReLU(),
nn.Conv1d(64, 64, kernel_size=3, padding=4, dilation=4), nn.ReLU(),
)
x = torch.randn(8, 64, 200) # (batch, channels, time) — note channels-first for Conv1dSelf-attention and the Transformer. The other escape is to drop recurrence entirely and let every position look directly at every other position in one parallel operation — self-attention. This solves the long-range problem (any two positions are one hop apart, not \(T\) steps apart) and the parallelism problem (all positions computed at once). The cost is quadratic in sequence length and a loss of the built-in notion of order, which is restored with positional encodings. This is the Transformer, and it is the direct sequel to the fixed-context bottleneck of §16.6 — covered in full in Chapter 17.
| Approach | Parallel over time? | Long-range reach | Built-in order? |
|---|---|---|---|
| RNN / LSTM / GRU | no (sequential) | good (gated) | yes (recurrence) |
| Dilated 1D conv (TCN) | yes | grows with depth | yes (local windows) |
| Self-attention (Transformer) | yes | direct, any-to-any | no — added via positional encoding |
When to still use an RNN: truly streaming/online settings (decide at each step with low latency), very long sequences where attention’s quadratic cost bites, and small-data regimes where a compact GRU generalizes better than a data-hungry Transformer. Recurrence is not obsolete — it is a specialist.
16.10 — Worked End-to-End: A Character-Level Language Model
To tie the chapter together, here is the smallest complete recurrent model that does something real: predict the next character of text. It is the canonical “hello world” of RNNs, and every piece below maps to a section above — embedding, a gated cell (§16.3), a many-to-many readout (§16.2), and sampling for generation (§16.7).
The task: given the characters seen so far, output a distribution over the next character. Train it on any corpus and it learns spelling, then words, then crude grammar — purely from the running hidden state.
import torch, torch.nn as nn, torch.nn.functional as F
class CharLM(nn.Module):
def __init__(self, vocab, emb=64, hid=256, layers=2):
super().__init__()
self.embed = nn.Embedding(vocab, emb)
self.lstm = nn.LSTM(emb, hid, num_layers=layers,
batch_first=True, dropout=0.2) # §16.8: between-layer dropout
self.head = nn.Linear(hid, vocab) # §16.2: one output per step
def forward(self, x, state=None):
e = self.embed(x) # (B, T) -> (B, T, emb)
out, state = self.lstm(e, state) # (B, T, hid)
return self.head(out), state # logits over vocab, per step
# --- one training step (teacher forcing is implicit: targets are inputs shifted by one) ---
model = CharLM(vocab=96)
opt = torch.optim.Adam(model.parameters(), lr=3e-3)
x, y = batch[:, :-1], batch[:, 1:] # predict char t+1 from chars up to t
logits, _ = model(x)
loss = F.cross_entropy(logits.reshape(-1, 96), y.reshape(-1))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0) # §16.2: tame exploding grads
opt.step(); opt.zero_grad()
# --- generation: feed back the model's own samples, carrying the state forward ---
@torch.no_grad()
def generate(model, start_ids, steps=200, temperature=0.8):
ids, state = start_ids, None
out = list(start_ids[0].tolist())
for _ in range(steps):
logits, state = model(ids, state) # carry hidden state across steps
probs = F.softmax(logits[:, -1] / temperature, -1)
nxt = torch.multinomial(probs, 1) # §16.7: temperature sampling
out.append(nxt.item()); ids = nxt
return outThree things to notice, each a callback to the chapter. Training uses teacher forcing for free — the target sequence is just the input shifted by one, so the loss at step \(t\) always conditions on the true prefix. Generation, by contrast, feeds the model’s own samples back in (the exposure-bias gap of §16.6), which is exactly why temperature matters there and not in training. And the hidden state is threaded out of generate and back in on every call — the same running memory of §16.1, now used at inference time to keep the model coherent across hundreds of characters.
Why this still teaches the field: scale this exact recipe up — bigger hidden size, more data, and (the one architectural swap) self-attention in place of the LSTM — and you have the skeleton of a modern language model. The training loop, the shifted targets, the sampling-with-temperature at generation time are unchanged. RNNs are where every one of those ideas was first made to work.
16.11 — Quick reference
| Term / formula | Meaning in one line | When / why it matters |
|---|---|---|
| Hidden state \(h_t = \tanh(W_{xh}x_t + W_{hh}h_{t-1} + b)\) | running memory updated by shared weights each step | the core RNN recurrence; same weights at every \(t\) |
| BPTT, \(\nabla_{W_{hh}}L = \sum_t \nabla_{W_{hh}}L_t\) | unroll, then sum the shared weight’s gradient over all steps | how RNNs train; source of vanishing/exploding gradients |
| Vanishing / exploding gradient | factor \(\approx w^T\) shrinks to 0 or blows up over \(T\) steps | why plain RNNs forget past ~10–20 steps |
| Truncated BPTT(\(k\)) | carry \(h\) forward (detached), backprop only \(k\) steps | train on long streams without exploding memory |
| Gradient clipping \(g \leftarrow g\cdot\tfrac{\tau}{\max(\|g\|,\tau)}\) | cap the gradient norm at \(\tau\), keep direction | the standard fix for exploding gradients |
| LSTM cell \(c_t = f_t\odot c_{t-1} + i_t\odot\tilde c_t\) | additive memory with forget/input/output gates | long-range memory; \(f_t{=}1,i_t{=}0\) copies a fact verbatim |
| Forget-gate bias \(b_f \approx 1\)–\(2\) | start gates near “remember” | faster convergence on long-dependency tasks |
| GRU \(h_t = (1-z_t)h_{t-1} + z_t\tilde h_t\) | one update gate \(z_t\) toggles keep vs. overwrite | lighter LSTM; strong default, fewer params |
| Bidirectional \(h_t = [\overrightarrow{h_t}; \overleftarrow{h_t}]\) | concatenate forward + backward readers | offline tasks (NER, POS) where future helps |
| Stacked (deep) RNN | layer \(\ell\)’s output sequence feeds layer \(\ell{+}1\) | builds temporal hierarchy; 2–3 layers typical |
| Seq2seq + context vector \(c\) | encoder compresses input → decoder unpacks output | variable-length mapping (translation); \(c\) is the bottleneck |
| Teacher forcing | feed ground-truth previous token while training | fast training; mind the exposure-bias gap at test |
| Beam search, \(\sum_i \log p(y_i\mid y_{<i},x)\) | keep top-\(k\) partial sequences by log-prob | one-right-answer decoding (translation, ASR) |
| Temperature / nucleus, \(p_i = \tfrac{e^{z_i/T}}{\sum_j e^{z_j/T}}\) | reshape distribution before sampling | open-ended generation (dialogue, stories) |
| Variational (locked) dropout | one dropout mask reused across all steps | safe regularizer for the recurrent path |
| Layer norm | normalize across features of one step | RNN/Transformer default; batch-size independent |
| Dilated 1D conv, field \(=1+(k-1)\sum_\ell d_\ell\) | gaps in the filter grow reach exponentially | parallel, long-reach alternative (WaveNet, TCN) |
| Self-attention | every position attends to every other in parallel | removes the bottleneck; the Transformer (Ch. 17) |
16.12 — Key takeaways
- An RNN carries a hidden state updated by the same weights at every step; unrolled, it is a deep weight-shared network trained by backpropagation through time.
- Vanilla RNNs suffer vanishing/exploding gradients — in the scalar case the recurrent factor raised to the sequence length — so they forget dependencies beyond ~10–20 steps. Vanishing is fixed by architecture (gated cells); exploding is fixed by training (gradient clipping).
- Sequence tasks come in four shapes (one-to-many, many-to-one, many-to-many aligned, seq2seq); the cell is identical — only which states you read out and feed back changes. Long streams are trained with truncated BPTT, carrying the hidden state forward but backpropagating only within a chunk.
- The LSTM adds a cell state with forget/input/output gates; with \(f_t=1, i_t=0\) a memory is copied verbatim, and that additive path lets gradients survive across hundreds of steps. A small positive forget-gate bias makes it default to remembering.
- The GRU is a lighter LSTM with two gates and no separate cell state; its update gate \(z_t\) alone toggles between keep (\(z_t\!\approx\!0\)) and overwrite (\(z_t\!\approx\!1\)) — a strong, cheaper default. Let validation, not gate count, choose between them.
- Bidirectional RNNs read both directions (offline only); deep/stacked RNNs build temporal hierarchy. A stacked BiLSTM was the pre-Transformer NLP workhorse.
- Seq2seq encoder–decoder maps variable-length input to variable-length output (trained with teacher forcing), but its single fixed-context vector is a bottleneck on long inputs — the problem that motivated attention.
- Decoding turns per-step distributions into sequences: greedy and beam search for one-right-answer tasks, temperature + nucleus sampling for open-ended generation.
- Regularize recurrent nets with variational/locked dropout on the recurrent path (or plain dropout between stacked layers) and layer norm rather than batch norm.
- Recurrence is sequential and slow; dilated 1D convolutions (WaveNet, TCN) and self-attention (Transformers) parallelize over time and reach further, leaving RNNs a specialist tool for streaming, very long, or small-data settings.
16.13 — See also
- Neural Networks (Core) — backpropagation, activations, and the feedforward machinery RNNs unroll into.
- Convolutional Neural Networks — the spatial counterpart to weight sharing across time, and the source of the 1D/dilated convolutions used for sequences.
- Attention & Transformers — how attention removes the fixed-context bottleneck and supersedes recurrence.
- Natural Language Processing — tagging, translation, and the tasks RNNs and seq2seq were built for.
- Large Language Models — where the Transformer successors to these models now live.
- Time Series & Forecasting — applying recurrent and sequence models to numeric temporal data.
- Speech & Audio Processing — streaming vs. offline recognition, a classic BiRNN/seq2seq domain.
↪ The thread continues → Chapter 17 · ⚡ Attention & Transformers
Recurrent nets pass information one step at a time and forget. The fix — letting every position look directly at every other — is attention, and the Transformer it powers is the spine of all that follows.
📖 All chapters | ← 15 · 🖼️ Convolutional Neural Networks | 17 · ⚡ Attention & Transformers →