Chapter 14 — 🧠 Neural Networks (Core)

📖 All chapters | ← 13 · 🕸️ Probabilistic Graphical Models | 15 · 🖼️ Convolutional Neural Networks →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Neural networks are the workhorse of modern AI: layered functions that learn, from examples, how to turn raw inputs into useful outputs. They take the linear models of classical ML and stack them with nonlinearities, so a single architecture can learn to recognize digits, translate text, or steer a car. This chapter is the foundation for everything in the Deep Learning part — convolutions, recurrence, and attention are all special wirings of the ideas here.

🧭 In context: Deep Learning · how to build and train a function that learns arbitrary input→output mappings from data · the one key idea is compose linear layers with nonlinear activations, then fit the weights by gradient descent on a loss.

💡 Remember this: A neural network is just linear layers stacked with nonlinear activations, trained by computing the gradient of a loss with backpropagation and stepping the weights downhill.

14.1 — Multilayer perceptron / feedforward networks

A multilayer perceptron (MLP), also called a feedforward network, is the simplest deep network: data flows in one direction, input → hidden layers → output, with no loops. Think of it as a stack of logistic-regression-like units where each layer’s outputs become the next layer’s inputs.

Each layer is a set of neurons (units). A neuron computes a weighted sum of its inputs plus a bias, then passes that through a nonlinear activation function. The reason we need many layers is representation: early layers learn simple features (edges, word fragments), later layers combine them into abstract concepts (faces, sentiment). One layer alone is just a linear model; stacking with nonlinearity is what gives the network its expressive power. The universal approximation theorem makes this precise: a single hidden layer with enough units can approximate any continuous function to arbitrary accuracy — but “enough” can mean astronomically many units, which is why in practice depth is far more efficient than width. Deep networks reuse features across layers; a shallow one would have to enumerate every combination.

A single neuron, opened up, is just a dot product followed by a squashing function — the same shape as logistic regression. Picture it as a little voting machine: each input arrives weighted, the votes are summed, a bias tips the scale, and the activation decides how loudly the neuron fires. The doodle below pulses to show a neuron gathering its inputs and “firing”:

A layer is fully described by a weight matrix \(W\) and bias vector \(b\). If layer \(\ell\) has \(n_\ell\) neurons and the previous layer has \(n_{\ell-1}\) outputs, then \(W^{[\ell]}\) has shape \(n_\ell \times n_{\ell-1}\) and \(b^{[\ell]}\) has length \(n_\ell\).

A single neuron computes \(a = g\!\left(\sum_{j} w_j x_j + b\right)\).

In words: multiply each input by its own weight, add them all up, nudge by a bias, then squash the result through the activation. Also written: \(a = g(\mathbf{w}^\top \mathbf{x} + b)\) — the sum is a dot product of the weight vector and the input vector.

Here is a tiny MLP with 3 inputs, one hidden layer of 4 units, and 2 outputs:

The dataflow through such a network is a simple chain of layers, each transforming the representation a little more:

flowchart LR
  X["x (3 features)"] --> H["hidden layer<br/>4 ReLU units"]
  H --> O["output layer<br/>2 units"]
  O --> Y["ŷ"]

Tip

Count parameters as you design: a layer mapping \(m\to n\) has \(n\times m\) weights plus \(n\) biases. The MLP above has \((4\cdot3+4) + (2\cdot4+2) = 16 + 10 = 26\) trainable numbers. Knowing this keeps your model from silently exploding in size.

In a real framework you rarely wire neurons by hand; you declare layers and let the library size the matrices for you:

import torch.nn as nn
mlp = nn.Sequential(
    nn.Linear(3, 4),   # input 3 → hidden 4  (12 weights + 4 biases)
    nn.ReLU(),
    nn.Linear(4, 2),   # hidden 4 → output 2 (8 weights + 2 biases)
)
print(sum(p.numel() for p in mlp.parameters()))   # -> 26, matches the count above

14.2 — Forward propagation

Forward propagation (the forward pass) is the act of feeding an input through the network to produce an output. For each layer you do two things: a linear step and a nonlinear step.

The linear pre-activation is \(z^{[\ell]} = W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\), and the activation is \(a^{[\ell]} = g(z^{[\ell]})\), where \(g\) is the activation function and \(a^{[0]} = x\) is the input. You repeat this layer by layer until you reach the output. That’s the whole forward pass — a chain of matrix multiplies interleaved with elementwise nonlinearities.

In words: to get a layer’s pre-activation, multiply the previous layer’s outputs by this layer’s weights and add the bias; then squash elementwise to get the activation. Also written: \(a^{[\ell]} = g\!\big(W^{[\ell]} a^{[\ell-1]} + b^{[\ell]}\big)\) — substituting \(z\) back in so the whole layer is one composed function.

Worked example. Take one input \(x = [1, 2]\), a hidden layer with

\[W^{[1]} = \begin{bmatrix} 0.1 & 0.3 \\ -0.2 & 0.4 \end{bmatrix}, \quad b^{[1]} = \begin{bmatrix} 0 \\ 0.1 \end{bmatrix}, \quad g = \text{ReLU}.\]

Then \(z^{[1]} = W^{[1]}x + b^{[1]} = [0.1\cdot1 + 0.3\cdot2,\; -0.2\cdot1 + 0.4\cdot2 + 0.1] = [0.7,\ 0.7]\), and \(a^{[1]} = \text{ReLU}([0.7, 0.7]) = [0.7, 0.7]\). That vector feeds the next layer, and so on.

import numpy as np
def forward(x, params):                 # params: list of (W, b, g)
    a = x
    for W, b, g in params:
        z = W @ a + b                    # linear: weighted sum + bias
        a = g(z)                         # nonlinear activation
    return a
relu = lambda z: np.maximum(0, z)
W1, b1 = np.array([[.1,.3],[-.2,.4]]), np.array([0,.1])
print(forward(np.array([1,2.]), [(W1,b1,relu)]))   # -> [0.7 0.7]

flowchart LR
  X[input x] --> Z1["z¹ = W¹x + b¹"] --> A1["a¹ = g(z¹)"]
  A1 --> Z2["z² = W²a¹ + b²"] --> A2["a² = ŷ"]

Tip

In code, run a whole batch at once by stacking examples as rows of a matrix \(X\) and computing \(Z = XW^\top + b\). The same math vectorizes; you just gain a batch dimension. This is why GPUs, which excel at big matrix multiplies, make neural nets fast.

The same forward pass in PyTorch is a single call — the framework runs it on CPU or GPU identically:

import torch
x = torch.tensor([[1., 2.]])      # shape (batch=1, features=2)
yhat = mlp[:2](x)                 # Linear(3,4)? — re-declare a 2→… net to match shapes
# In practice: model.eval(); with torch.no_grad(): yhat = model(x)

14.3 — Backpropagation

Training means adjusting the weights to reduce the loss. To do that we need the gradient of the loss with respect to every weight. Backpropagation is the algorithm that computes those gradients efficiently by applying the chain rule of calculus, working backward from the output to the input. The intuition: each weight’s influence on the loss flows through the layers above it, so we compute the error at the output, then pass it back layer by layer, multiplying by local derivatives.

A useful mental picture: think of blame flowing backward. The output made an error; each layer asks “how much of that error is my fault?” and answers by multiplying the blame coming from above by its own local sensitivity. Backprop is just bookkeeping that reuses the blame already computed at higher layers instead of recomputing it for every weight from scratch.

Give the blame a name: \(\delta^{[\ell]} = \partial L / \partial z^{[\ell]}\) is “how much the loss cares about layer \(\ell\)’s pre-activation.” The whole algorithm is then three short rules:

Start the blame at the output. \(\delta^{[L]} = \dfrac{\partial L}{\partial a^{[L]}} \odot g'(z^{[L]})\) — how wrong the final answer was, scaled by how steep the output activation was.
Pass the blame down a layer. \(\delta^{[\ell]} = (W^{[\ell+1]\top}\delta^{[\ell+1]}) \odot g'(z^{[\ell]})\) — pull the blame above back through the weights, then scale by this layer’s slope.
Read off the weight gradients. \(\dfrac{\partial L}{\partial W^{[\ell]}} = \delta^{[\ell]}\,a^{[\ell-1]\top}\) and \(\dfrac{\partial L}{\partial b^{[\ell]}} = \delta^{[\ell]}\) — a weight’s gradient is its incoming blame times the input that fed it.

The animation shows the blame flowing right-to-left, brightening each layer in turn:

In words: the error at a layer is the error from the layer above, pulled back through that layer’s weights and then scaled by how steep this layer’s activation was; a weight’s gradient is its incoming error times the input that fed it. Also written: \(\delta^{[\ell]} = \big(W^{[\ell+1]\top}\delta^{[\ell+1]}\big)\odot g'\!\big(z^{[\ell]}\big)\) and \(\nabla_{W^{[\ell]}} L = \delta^{[\ell]}\,(a^{[\ell-1]})^\top\) (outer product form).

Tiny worked example. A 1-input, 1-hidden, 1-output network, all scalars, identity activations: \(z_1 = w_1 x\), \(a_1 = z_1\), \(\hat y = w_2 a_1\), loss \(L = \tfrac12(\hat y - y)^2\). Let \(x=2,\ w_1=0.5,\ w_2=1.5,\ y=3\). Forward: \(a_1 = 1.0\), \(\hat y = 1.5\). The error is \(\hat y - y = -1.5\).

By the chain rule, \(\partial L/\partial w_2 = (\hat y - y)\cdot a_1 = -1.5 \cdot 1.0 = -1.5\), and \(\partial L/\partial w_1 = (\hat y - y)\cdot w_2 \cdot x = -1.5 \cdot 1.5 \cdot 2 = -4.5\). Notice how the same error \(-1.5\) is reused and multiplied by local terms as it flows back — that reuse is exactly why backprop is fast (one backward pass, not one per weight).

We can verify that \(\partial L/\partial w_2 = -1.5\) numerically by nudging \(w_2\) and watching the loss:

import numpy as np
x, w1, w2, y = 2., .5, 1.5, 3.
def loss(w1, w2):
    a1 = w1 * x                 # forward pass
    yhat = w2 * a1
    return .5 * (yhat - y)**2
eps = 1e-6                      # finite-difference gradient check
num = (loss(w1, w2+eps) - loss(w1, w2-eps)) / (2*eps)
print(round(num, 4))           # -> -1.5  (matches the analytic value)

flowchart RL
  L["∂L/∂ŷ = ŷ - y"] --> D2["δ² → ∂L/∂W²"]
  D2 --> D1["δ¹ = Wᵀδ² ⊙ g'(z¹)"]
  D1 --> DW1["∂L/∂W¹"]

In a framework you never write these rules out; automatic differentiation records the forward operations and replays them backward. The same tiny example, computed by PyTorch’s autograd:

import torch
x = torch.tensor(2.)
w1 = torch.tensor(.5, requires_grad=True)
w2 = torch.tensor(1.5, requires_grad=True)
y  = torch.tensor(3.)
loss = .5 * (w2 * (w1 * x) - y)**2
loss.backward()                 # autograd runs backprop for you
print(w1.grad.item(), w2.grad.item())   # -> -4.5 -1.5

Warning

Backprop needs the activations \(a^{[\ell]}\) from the forward pass to compute gradients, so a framework caches them. This is why training uses far more memory than inference — every intermediate tensor is held until the backward pass consumes it.

14.4 — Computational graphs & automatic differentiation

Backprop felt magical in §14.3 — how does a framework “know” the chain rule for an arbitrary network you typed out? The answer is the computational graph: every operation you write (a multiply, an add, a ReLU) becomes a node, and the tensors flowing between them become edges. Forward propagation walks the graph left-to-right computing values; backprop walks it right-to-left, and at each node it only needs to know one thing — that node’s local derivative. The chain rule then stitches the local pieces into the full gradient automatically. You never derive anything by hand; you just build the graph by writing normal code, and the framework differentiates it.

The intuition: think of the graph as a factory assembly line. Going forward, each station adds a part (computes its output). Going backward, each station is handed “how much the final product cares about my output” and multiplies in “how much my output cares about my inputs” — that product is what it passes upstream. Every station only needs the rule for its own little operation; nobody needs the blueprint for the whole line.

Here is the graph for the tiny network \(L = \tfrac12(w_2\,(w_1 x) - y)^2\) from §14.3 — the same example, drawn as nodes:

Frameworks come in two flavors of graph. A static graph (older TensorFlow 1.x) is defined once, then executed many times — fast, but awkward to debug because the code that builds the graph runs separately from the code that runs it. A dynamic graph (PyTorch, JAX, TF 2.x eager) is built on the fly each forward pass, so the graph can change every iteration (handy for variable-length inputs) and you can inspect intermediate values with an ordinary print. This mode is also called define-by-run.

The mode of differentiation backprop uses is reverse-mode autodiff: start from the single scalar loss and sweep backward to all parameters at once. That direction is the whole reason training is affordable — one backward sweep yields the gradient for millions of weights in roughly the cost of one forward pass. (The opposite, forward-mode, would need one sweep per parameter — fine for one input and many outputs, useless when you have one loss and a billion weights.)

In words: record what operations produced the loss; to differentiate, replay them in reverse, multiplying each node’s local derivative into the running gradient. Also written: for a chain \(L = f_n(\dots f_2(f_1(w)))\), reverse-mode computes \(\frac{\partial L}{\partial w} = \prod_{k} f_k'\) by accumulating the product from \(f_n\) down to \(f_1\) — exactly the \(\delta\) recursion of §14.3.

You can watch the graph being built and traversed. PyTorch tags each result tensor with the function that created it (grad_fn), and backward() walks those tags:

import torch
x = torch.tensor(2.0)
w1 = torch.tensor(0.5, requires_grad=True)
w2 = torch.tensor(1.5, requires_grad=True)
a1 = x * w1            # node: MulBackward
yhat = a1 * w2         # node: MulBackward
loss = 0.5 * (yhat - 3.0)**2
print(loss.grad_fn)    # <MulBackward0 ...> — the graph remembers how loss was made
loss.backward()        # reverse-mode sweep over the recorded graph
print(w1.grad, w2.grad)         # tensor(-4.5) tensor(-1.5)

Tip

Two everyday consequences of the graph: call optimizer.zero_grad() each step or gradients accumulate into the graph’s leaves (sometimes you want that — e.g. simulating a big batch); and wrap pure inference in with torch.no_grad(): so the framework skips building the graph and saves memory.

14.5 — Activation functions

The activation function \(g\) is the nonlinearity applied after each weighted sum. Without it, stacking layers collapses into a single linear map: \(W_2(W_1 x) = (W_2 W_1)x\) is still linear, so the network could never learn a curved decision boundary. Nonlinearity is what makes depth meaningful.

The common choices, with their shapes:

Function	Formula	Range	Notes
Sigmoid	\(1/(1+e^{-z})\)	(0,1)	smooth, but saturates → vanishing gradients
Tanh	\((e^z-e^{-z})/(e^z+e^{-z})\)	(−1,1)	zero-centered, still saturates
ReLU	\(\max(0,z)\)	[0,∞)	cheap, default; can “die” at 0
Leaky ReLU	\(\max(\alpha z, z)\)	(−∞,∞)	small slope \(\alpha\) keeps dead units alive
GELU	\(z\,\Phi(z)\)	≈(−0.17,∞)	smooth ReLU, standard in Transformers

Reading the curves: sigmoid and tanh both saturate — their tails flatten, so the derivative there is nearly zero and gradients stall. ReLU is a simple hinge: it passes positives unchanged and kills negatives, which is cheap and keeps a gradient of exactly 1 for active units. Leaky ReLU and GELU soften ReLU’s hard zero so units never fully die.

import numpy as np
sigmoid = lambda z: 1/(1+np.exp(-z))
tanh    = np.tanh
relu    = lambda z: np.maximum(0, z)
leaky   = lambda z, a=0.01: np.where(z>0, z, a*z)
gelu    = lambda z: 0.5*z*(1+np.tanh(np.sqrt(2/np.pi)*(z+0.044715*z**3)))

Tip

Default to ReLU for hidden layers in plain MLPs and CNNs; reach for GELU in Transformers. Save sigmoid for a single output that must be a probability, not for hidden layers.

14.5.1 — Why ReLU “dies,” and the modern menu

The everyday way to picture a dead ReLU: a light switch stuck in the off position. If a unit’s weighted sum lands in the negative zone for every training example, ReLU outputs 0 and — crucially — its gradient is also 0, so no update can ever push it back on. A bad batch or too-aggressive a learning rate can flip a chunk of units off permanently, and they contribute nothing for the rest of training.

The cures are exactly the soft-bottom variants. Leaky ReLU lets a trickle of gradient through on the negative side (\(\alpha z\) with \(\alpha\approx0.01\)). PReLU learns that slope per channel instead of fixing it. ELU and SiLU/Swish (\(z\cdot\sigma(z)\)) curve smoothly through zero, and GELU — the de-facto choice in Transformers — weights the input by the probability a standard normal is below it. The practical menu:

Activation	One-line intuition	Use it when
ReLU	hard hinge, gradient 0 or 1	default for MLP/CNN hidden layers
Leaky / PReLU	ReLU with a leaky bottom	many units are dying
GELU	“soft, probabilistic ReLU”	Transformers, modern NLP
SiLU / Swish	\(z\cdot\sigma(z)\), smooth self-gate	EfficientNet, some vision nets
Sigmoid / Tanh	squash to (0,1)/(−1,1)	output gates, final probability

import torch.nn.functional as F
# framework activations — fused, numerically careful implementations
F.relu(z); F.leaky_relu(z, 0.01); F.gelu(z); F.silu(z)   # pick per the table

14.6 — Softmax

When the output is a choice among \(K\) classes, the network produces \(K\) raw scores called logits. Softmax turns those logits into a probability distribution — all positive, summing to 1 — so the largest logit becomes the most likely class while keeping the others in proportion.

Think of softmax as a “soft” version of just picking the maximum: instead of awarding all the probability to the single biggest logit, it hands out probability in proportion to how big each logit is after exponentiating, so a close runner-up still gets a meaningful share.

The formula for class \(i\) is

\[\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}.\]

In words: exponentiate every logit to make it positive, then divide each by the total so they sum to 1. Also written: \(\text{softmax}(z)_i = \dfrac{\exp(z_i)}{\mathbf{1}^\top \exp(z)}\) — the denominator is the sum of the elementwise-exponentiated vector.

Worked example. Logits \(z = [2.0,\ 1.0,\ 0.1]\). Exponentiate: \([7.39,\ 2.72,\ 1.11]\), sum \(= 11.22\). Divide: \([0.659,\ 0.242,\ 0.099]\) — a valid distribution, peaked on class 0.

Where it shows up: every time a language model picks the next word, a softmax over the whole vocabulary turns the model’s raw scores into “probability of each possible next token” — and the temperature knob you see in chat APIs simply divides the logits before this softmax, flattening or sharpening that distribution to make outputs more random or more focused.

def softmax(z):
    z = z - np.max(z)          # subtract max for numerical stability
    e = np.exp(z)
    return e / e.sum()
print(softmax(np.array([2.0, 1.0, 0.1])))   # [0.659 0.242 0.099]

Warning

\(e^{z}\) overflows for large logits. Always subtract \(\max(z)\) first — it shifts every term equally and leaves the ratios unchanged, but keeps the exponentials in a safe range. Every production softmax does this.

14.7 — Weight initialization

Before training, weights need starting values. The naive choice — all zeros — fails catastrophically: every neuron in a layer then computes the identical output and receives the identical gradient, so they update in lockstep and stay identical forever. This is the symmetry problem; the cure is small random weights to break it.

But “small random” needs the right scale. Too large and activations blow up; too small and signals vanish as they pass through layers. The fix is to scale the variance by the layer’s width (fan-in \(n_{in}\), the number of inputs). Xavier/Glorot initialization suits sigmoid/tanh and uses variance \(1/n_{in}\) (or \(2/(n_{in}+n_{out})\)); He initialization suits ReLU and uses variance \(2/n_{in}\), compensating for ReLU zeroing half its inputs.

\[W \sim \mathcal{N}\!\left(0,\ \frac{2}{n_{in}}\right) \quad \text{(He, for ReLU)}\]

In words: draw each weight from a bell curve centered at zero whose spread shrinks as the layer gets wider, so the total signal entering a neuron stays a steady size. Also written: \(W_{ij} \sim \mathcal{N}(0,\sigma^2)\) with \(\sigma = \sqrt{2/n_{in}}\) — the standard-deviation form of the same rule.

Worked example. Push a random signal through 50 layers and watch the activation scale. With He init the standard deviation stays near 1; halve the recommended scale and it decays toward zero (vanishing), double it and it explodes:

import numpy as np
x = np.random.randn(256)
for scale in [1.0, 0.5, 2.0]:               # multiplier on the He std
    a = x.copy()
    for _ in range(50):                     # 50 ReLU layers
        W = np.random.randn(256, 256) * np.sqrt(2/256) * scale
        a = np.maximum(0, W @ a)
    print(scale, round(float(a.std()), 4))  # 1.0→~O(1), 0.5→~0, 2.0→huge

Tip

Rule of thumb: ReLU networks → He, tanh/sigmoid → Xavier. Biases start at zero (no symmetry issue there). Modern frameworks pick a sensible default, but a wrong init can stall training entirely.

In PyTorch you set the scheme explicitly when the default doesn’t match your activation:

import torch.nn as nn
layer = nn.Linear(256, 256)
nn.init.kaiming_normal_(layer.weight, nonlinearity="relu")  # He init
nn.init.zeros_(layer.bias)                                  # biases at 0

14.8 — Loss / cost functions

The loss function measures how wrong a single prediction is; the cost function is the loss averaged over the dataset. Training minimizes it. The right loss depends on the task.

For regression (predicting a number), use mean squared error (MSE): \(L = \tfrac1N\sum (\hat y - y)^2\). It penalizes large errors quadratically and pairs naturally with a linear output.

In words: average the squared gap between prediction and truth, so being off by 2 hurts four times as much as being off by 1. Also written: \(L = \tfrac1N\,\lVert \hat{\mathbf y} - \mathbf y \rVert_2^2\) — the squared Euclidean distance between the prediction and target vectors, divided by \(N\).

For classification, use cross-entropy, which measures the distance between the predicted probability distribution and the true one. For \(K\) classes with one-hot target \(y\): \(L = -\sum_{i} y_i \log \hat y_i\). Because the truth is one-hot, this reduces to \(-\log \hat y_{\text{correct}}\) — the loss is just the negative log-probability the model assigned to the right answer.

In words: the loss is how surprised the model is by the true label — the lower the probability it gave the correct class, the bigger the penalty. Also written: \(L = -\log \hat y_{c}\) where \(c\) is the true class index — the sum collapses because only the correct class has \(y_i = 1\).

Worked example. True class is 0, model predicts \(\hat y = [0.659, 0.242, 0.099]\) (from §14.6). Cross-entropy \(= -\log(0.659) = 0.417\). If the model had been confident-and-correct (\(\hat y_0 = 0.99\)), loss \(= 0.01\); confident-and-wrong (\(\hat y_0 = 0.01\)) gives \(4.6\) — cross-entropy punishes confident mistakes hard.

Task	Output activation	Loss
Regression	linear	MSE
Binary classification	sigmoid	binary cross-entropy
Multi-class	softmax	categorical cross-entropy

import torch.nn as nn
mse  = nn.MSELoss()                 # regression
bce  = nn.BCEWithLogitsLoss()       # binary: fuses sigmoid + cross-entropy
ce   = nn.CrossEntropyLoss()        # multi-class: fuses softmax + cross-entropy
# Note: pass RAW logits to CrossEntropyLoss / BCEWithLogitsLoss — they apply
# softmax/sigmoid internally for numerical stability. Don't softmax twice.

Warning

Don’t pair MSE with softmax for classification — the gradients are weak when the model is badly wrong. Softmax + cross-entropy gives a clean gradient of \(\hat y - y\), which is why it’s the universal default for classifiers.

14.9 — Gradient descent: the update step

Backprop hands you the gradient; gradient descent is what uses it to actually move the weights. The gradient \(\nabla_w L\) points in the direction of steepest increase of the loss, so to shrink the loss you step the opposite way. That’s the whole training engine in one line.

The intuition everyone reaches for: you’re a hiker in fog trying to reach the valley floor. You can’t see the whole landscape, but you can feel the slope under your feet (the gradient). So you take a small step straight downhill, feel the slope again, step again — repeat until the ground is flat. The learning rate \(\eta\) is your stride length: too short and you crawl, too long and you bound clear over the valley and end up higher than you started.

Here is that hiker as a ball easing down the loss bowl, settling at the minimum:

\[w \leftarrow w - \eta\,\nabla_w L\]

In words: nudge each weight a little in the downhill direction; how far is set by the learning rate. Also written: \(w_{t+1} = w_t - \eta\,\dfrac{\partial L}{\partial w}\Big|_{w_t}\) — the same step, indexed by iteration \(t\).

The learning rate is the single most consequential dial in all of training. The picture below shows the three regimes: too small (slow crawl, may never arrive), just right (smooth descent to the minimum), too large (overshoot, bounce, possibly diverge).

Worked example. One weight \(w = 4.0\), learning rate \(\eta = 0.1\), and a gradient of \(\nabla_w L = 6.0\) at this point. One step: \(w \leftarrow 4.0 - 0.1\cdot6.0 = 3.4\). If the loss were the simple bowl \(L = \tfrac12 w^2\) (so \(\nabla_w L = w\)), repeated steps give \(w \leftarrow (1-\eta)w\) — a geometric decay \(4.0 \to 3.6 \to 3.24 \to \dots\) gliding toward the minimum at \(0\). Bump \(\eta\) past \(2\) and the same rule grows instead: divergence.

import numpy as np
w, eta = 4.0, 0.1
grad = lambda w: w                       # gradient of L = ½w²  (minimum at 0)
for step in range(6):
    w -= eta * grad(w)                   # the update rule, by hand
    print(round(w, 4))                   # 3.6 → 3.24 → 2.916 → ... → toward 0

In a framework the optimizer owns this update; you only supply gradients (via backward()) and call step(). Plain SGD is the rule above; momentum and Adam (covered in Optimization) refine which direction and how far using the history of gradients, but the skeleton is identical:

import torch
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)   # plain w ← w - η∇L
# ... inside the training loop:
optimizer.zero_grad()
loss.backward()                          # backprop fills .grad on every weight
optimizer.step()                         # applies w ← w - η·grad to all params

Tip

If a network won’t train, change the learning rate first, by factors of 10 (1e-2, 1e-3, 1e-4). It explains more training failures than architecture, init, or optimizer choice combined. The full menu of momentum, Adam, and learning-rate schedules lives in the Optimization chapter.

14.10 — Batch & layer normalization

As training proceeds, the distribution of inputs to a layer keeps shifting (its previous layer’s weights are changing). This internal covariate shift slows learning. Normalization layers fix it by re-centering and re-scaling activations to a stable distribution, then letting the network learn its own preferred scale and shift via two trainable parameters \(\gamma\) (scale) and \(\beta\) (shift).

The mechanism: compute mean \(\mu\) and variance \(\sigma^2\) of the activations, normalize to \(\hat x = (x-\mu)/\sqrt{\sigma^2+\epsilon}\), then output \(\gamma \hat x + \beta\). The difference between the two main variants is what set you average over.

In words: subtract the average and divide by the spread so the numbers sit at zero-mean, unit-scale; then let the network rescale and re-shift them however it likes. Also written: \(y = \gamma\,\dfrac{x-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta\) — one elementwise affine map applied to the standardized activations.

Batch normalization computes \(\mu,\sigma^2\) across the batch dimension (per feature, over all examples in the mini-batch). Great for CNNs, but it depends on batch size and behaves differently at train vs. test time (it uses running averages at inference).
Layer normalization computes \(\mu,\sigma^2\) across the feature dimension (per example, over its own activations). It’s independent of batch size and identical at train and test time, which is why Transformers and RNNs use it.

The shaded band shows what each method averages over: BatchNorm pools one feature down the whole batch (a column), LayerNorm pools all features of one example (a row).

Worked example. One example’s activations \(x = [1, 2, 3, 4]\) under LayerNorm. Mean \(\mu = 2.5\), variance \(\sigma^2 = 1.25\), \(\sqrt{\sigma^2} = 1.118\). Normalized: \(\hat x = [-1.34, -0.45, 0.45, 1.34]\) — zero-centered, unit-scaled. With \(\gamma=1, \beta=0\) that is the output; the network can later learn other \(\gamma,\beta\) if a different scale serves it better.

def layernorm(x, gamma=1.0, beta=0.0, eps=1e-5):
    mu  = x.mean()                        # over this example's features
    var = x.var()
    return gamma * (x - mu) / np.sqrt(var + eps) + beta
print(layernorm(np.array([1.,2,3,4])))    # ~[-1.34 -0.45 0.45 1.34]

The framework versions are drop-in layers — note BatchNorm needs .eval() at test time to switch to its running statistics, while LayerNorm behaves the same in both modes:

import torch.nn as nn
bn = nn.BatchNorm1d(64)   # CNNs/MLPs: normalizes each of 64 features over the batch
ln = nn.LayerNorm(64)     # Transformers/RNNs: normalizes each example's 64 features
# model.eval() makes bn use running mean/var; ln is unaffected by train/eval mode

Beyond fixing covariate shift, normalization smooths the loss landscape, which lets you use higher learning rates and makes training far more robust.

14.11 — Vanishing / exploding gradients

In a deep network, backprop multiplies many derivatives together as it travels back through layers. If those factors are consistently less than 1, the product shrinks toward zero — the vanishing gradient problem, where early layers learn agonizingly slowly. If they’re consistently greater than 1, the product blows up — the exploding gradient problem, where updates overshoot and the loss diverges to NaN.

Intuition by analogy: it’s like a rumor passed down a long line of people. If each person softens it a little (factor < 1), by the end nothing of the original is left — the early layers never hear the correction. If each person exaggerates it (factor > 1), the message snowballs into nonsense — the loss explodes to NaN.

The classic cause of vanishing gradients is saturating activations: sigmoid’s derivative maxes out at \(0.25\), so a 10-layer sigmoid network multiplies ten factors each \(\le 0.25\), giving a gradient near \(0.25^{10}\approx10^{-6}\). The signal simply doesn’t reach the early layers.

The fixes are exactly the ingredients in this chapter, working together:

flowchart TD
  P[deep net trains badly] --> V{gradient too small?}
  V -->|yes| F1[non-saturating activation: ReLU/GELU]
  V -->|yes| F2[He/Xavier init]
  V -->|yes| F3[normalization layers]
  V -->|yes| F4[residual/skip connections]
  P --> E{gradient too large?}
  E -->|yes| G1[gradient clipping]
  E -->|yes| G2[lower learning rate]

Gradient clipping caps the gradient’s norm at a threshold (e.g. rescale so \(\|g\| \le 5\)) — the standard cure for exploding gradients, especially in RNNs. Residual connections (covered with CNNs and Transformers) add a shortcut \(x + f(x)\) so gradients have an unobstructed path back, which is what makes 100-layer networks trainable at all.

In practice, gradient clipping is one line, inserted between the backward pass and the optimizer step:

loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=5.0)  # cap ‖g‖
optimizer.step()

Tip

First diagnostic when a deep net won’t learn: print the gradient norm per layer. Near-zero norms at early layers → vanishing (switch to ReLU/He/norm). NaN loss → exploding (clip gradients, drop the learning rate).

14.12 — Regularization: dropout & weight decay

A network with thousands of free weights can simply memorize the training set — nailing those examples while flopping on new ones. That gap is overfitting, and regularization is any technique that fights it by discouraging the model from leaning too hard on any one detail. Two are so common they belong here next to the core training machinery.

Dropout is the “don’t rely on any single teammate” trick. During training, each neuron is randomly switched off with probability \(p\) (a fresh random mask every forward pass). The network can no longer depend on one specific unit always being present, so it spreads its bets and learns redundant, robust features. At test time dropout is turned off and activations are scaled so the expected output matches training. Intuitively, it trains a huge ensemble of thinned sub-networks that share weights.

\[a_i^{\text{train}} = \frac{m_i}{1-p}\,a_i, \qquad m_i \sim \text{Bernoulli}(1-p)\]

In words: with probability \(p\) zero out each unit; keep the survivors but scale them up by \(1/(1-p)\) so the layer’s average strength is unchanged. Also written: \(\mathbf a^{\text{train}} = \dfrac{1}{1-p}\,(\mathbf m \odot \mathbf a)\) with \(\mathbf m\) a random 0/1 mask — the “inverted dropout” form used in practice.

Weight decay (equivalent to L2 regularization) is the “keep weights small unless the data insists” trick. It adds a penalty \(\tfrac{\lambda}{2}\lVert W\rVert_2^2\) to the loss, which on each step nudges every weight a little toward zero. Small weights mean a smoother, simpler function that generalizes better.

\[L_{\text{total}} = L_{\text{data}} + \frac{\lambda}{2}\sum_w w^2\]

In words: add a tax proportional to the squared size of the weights, so the optimizer only grows a weight when the data-fit improvement outweighs the tax. Also written: the gradient picks up a term \(\lambda w\), so the update becomes \(w \leftarrow (1-\eta\lambda)\,w - \eta\,\nabla_w L_{\text{data}}\) — every step shrinks \(w\) by a factor before the data gradient is applied.

Worked example. A layer with weight \(w=4.0\), learning rate \(\eta=0.1\), decay \(\lambda=0.01\), and zero data gradient for the moment. The decay term alone updates \(w \leftarrow (1 - 0.1\cdot0.01)\cdot4.0 = 0.999\cdot4.0 = 3.996\) — a tiny, steady pull toward zero on every step that only large, useful weights can resist.

import torch.nn as nn
net = nn.Sequential(
    nn.Linear(128, 64), nn.ReLU(),
    nn.Dropout(p=0.5),                 # drop half the units each forward pass
    nn.Linear(64, 10),
)
# weight decay is an optimizer argument, not a layer:
opt = torch.optim.AdamW(net.parameters(), lr=1e-3, weight_decay=0.01)
# net.train() enables dropout; net.eval() disables it for inference

Tip

Dropout and weight decay stack with the other generalization tools — early stopping (§14.13), data augmentation, and smaller models. Turn dropout off at inference (model.eval()), or your predictions will be needlessly noisy.

14.13 — Epochs, batches & the training loop

Training repeats one core cycle: forward pass → compute loss → backward pass → update weights. The vocabulary describes how we slice the data through this cycle. A batch (or mini-batch) is a small group of examples processed together before one weight update. An iteration is one such update. An epoch is one full pass over the entire training set.

Why batches? Computing the gradient on the whole dataset every step (batch gradient descent) is accurate but slow and memory-hungry; using one example at a time (stochastic gradient descent) is noisy. Mini-batch training (typically 32–512 examples) is the practical middle ground — efficient on GPUs and noisy enough to escape bad minima. So with 10,000 examples and batch size 100, one epoch = 100 iterations.

flowchart LR
  D[training data] --> S[shuffle] --> B[next mini-batch]
  B --> F[forward: predict] --> L[loss] --> G[backward: gradients]
  G --> U[update weights] --> B
  B -.epoch done.-> S

for epoch in range(n_epochs):
    np.random.shuffle(data)                  # reshuffle each epoch
    for batch in batches(data, size=64):
        preds = forward(batch.X, params)     # 1. forward
        loss  = cross_entropy(preds, batch.y)# 2. loss
        grads = backward(loss, params)       # 3. backprop
        for p, g in zip(params, grads):      # 4. gradient-descent update
            p -= learning_rate * g
    # validate, log loss, maybe early-stop

The same loop in idiomatic PyTorch — the five lines inside the batch loop are the canonical training step every practitioner has memorized:

import torch
model.train()
for epoch in range(n_epochs):
    for xb, yb in train_loader:              # DataLoader shuffles & batches
        optimizer.zero_grad()                # 1. clear old gradients
        preds = model(xb)                    # 2. forward
        loss = criterion(preds, yb)          # 3. loss (raw logits → CrossEntropyLoss)
        loss.backward()                      # 4. backprop (autograd)
        optimizer.step()                     # 5. update weights
    # model.eval(); validate; early-stop on validation loss

The learning rate scales each update; it’s the single most important hyperparameter (see the Optimization chapter for SGD, momentum, and Adam). Shuffling each epoch prevents the model from memorizing batch order. You typically track validation loss to decide when to stop.

Early stopping is the standard way to decide when to stop: watch the validation loss each epoch, and once it stops improving for a set number of epochs (the “patience”), halt and keep the best checkpoint. It’s a cheap, effective regularizer — it stops training right before the model starts memorizing.

Warning

A batch too large can hurt generalization (the gradient gets so smooth it heads straight for a sharp minimum) and a learning rate too high makes loss diverge. When in doubt, lower the learning rate first — it fixes more training failures than any other single knob.

14.14 — A complete worked network (Keras)

Pulling the whole chapter together, here is an end-to-end classifier — layers, activations, softmax output, cross-entropy loss, an optimizer, and the training loop — in a few lines of Keras. Every piece maps to a section above.

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation="relu"),   # §14.1 layer + §14.5 ReLU
    keras.layers.BatchNormalization(),            # §14.10 normalization
    keras.layers.Dropout(0.3),                    # §14.12 regularization
    keras.layers.Dense(10, activation="softmax"), # §14.6 softmax output
])
model.compile(
    optimizer=keras.optimizers.Adam(1e-3),        # §14.9 update rule
    loss="sparse_categorical_crossentropy",       # §14.8 loss
    metrics=["accuracy"],
)
model.fit(                                        # §14.13 epochs/batches loop
    X_train, y_train, epochs=10, batch_size=64,
    validation_split=0.1,
    callbacks=[keras.callbacks.EarlyStopping(patience=3,    # §14.13 early stop
                                             restore_best_weights=True)],
)

That single fit call runs forward propagation (§14.2), backpropagation (§14.3) over the computational graph (§14.4), and mini-batch gradient descent (§14.9, §14.13) under the hood — the entire chapter, executed on every batch.

14.15 — Quick reference

Term / formula	Meaning	When / why
Neuron \(a = g(\mathbf{w}^\top\mathbf{x}+b)\)	weighted sum, bias, then nonlinearity	the atomic unit of every layer
MLP / feedforward net	layers of neurons, data flows one way	baseline deep model; no loops
Forward pass \(a^{[\ell]}=g(W^{[\ell]}a^{[\ell-1]}+b^{[\ell]})\)	chain of matmul + activation	turns input into prediction
Backprop \(\delta^{[\ell]}=(W^{[\ell+1]\top}\delta^{[\ell+1]})\odot g'(z^{[\ell]})\)	chain rule, error flows backward	all gradients in one backward sweep
Computational graph	ops as nodes, tensors as edges	how autodiff knows the chain rule
Reverse-mode autodiff	one backward sweep from scalar loss	cheap gradients for millions of weights
ReLU \(\max(0,z)\)	cheap hinge nonlinearity	default hidden activation (MLP/CNN)
GELU \(z\,\Phi(z)\)	smooth, probabilistic ReLU	default in Transformers / modern NLP
Softmax \(e^{z_i}/\sum_j e^{z_j}\)	logits → probability distribution	multi-class output layer
Cross-entropy \(-\log\hat y_c\)	negative log-prob of true class	classification loss (with softmax)
MSE \(\tfrac1N\sum(\hat y-y)^2\)	mean squared gap	regression loss (with linear output)
He init \(\sigma=\sqrt{2/n_{in}}\)	scale weights by fan-in	keeps ReLU signal steady; Xavier for tanh
Gradient descent \(w\leftarrow w-\eta\nabla_w L\)	step weights downhill	the core update; \(\eta\) is the key dial
BatchNorm / LayerNorm \(\gamma\hat x+\beta\)	standardize then re-scale activations	stabilize & speed training (LN for Transformers)
Dropout \(\tfrac{1}{1-p}(\mathbf m\odot\mathbf a)\)	randomly zero units in training	regularizer; off at inference (`eval()`)
Weight decay \(\tfrac{\lambda}{2}\lVert W\rVert^2\)	penalize large weights	L2 regularizer; smoother function
Gradient clipping \(\lVert g\rVert\le\tau\)	cap gradient norm	cure for exploding gradients (RNNs)
Epoch / batch / iteration	full pass / group / one update	mini-batch (32–512) is the practical default
Early stopping	halt when val loss plateaus	cheap regularizer; keeps best checkpoint

14.16 — Key takeaways

An MLP stacks layers of (weighted sum + bias + nonlinear activation); depth + nonlinearity is what gives expressive power.
Forward propagation is repeated \(z=Wa+b\), \(a=g(z)\); backpropagation reuses the chain rule to get all gradients in one backward pass.
A network is a computational graph; frameworks build it as you write code and run reverse-mode autodiff over it to get every gradient at once.
Activations must be nonlinear — ReLU by default, GELU in Transformers, sigmoid only for probability outputs; soft-bottom variants cure dead units.
Softmax turns logits into probabilities; pair it with cross-entropy loss (and use MSE for regression).
Initialize weights small and random (never zero) with He (ReLU) or Xavier (tanh) scaling.
Gradient descent updates \(w \leftarrow w - \eta\nabla_w L\); the learning rate \(\eta\) is the dial that most often makes or breaks training.
Normalization (batch vs. layer) stabilizes activations and speeds training; layer norm dominates in Transformers.
Regularize with dropout and weight decay (and early stopping) to close the train–test gap.
Vanishing/exploding gradients come from multiplying many derivatives; fix with non-saturating activations, good init, normalization, residuals, and gradient clipping.
Train in mini-batches over multiple epochs; the learning rate is the knob that most often makes or breaks training.

14.17 — See also

Calculus & Differentiation — the chain rule that backpropagation is built on.
Optimization — SGD, momentum, Adam, and learning-rate schedules that drive the update step.
Convolutional Neural Networks — feedforward networks specialized for images, where BatchNorm shines.
Recurrent & Sequence Models — where vanishing gradients and gradient clipping matter most.
Attention & Transformers — built from these layers plus LayerNorm, GELU, and residual connections.
Model Evaluation & Tuning — how to choose batch size, learning rate, and stopping via validation.

↪ The thread continues → Chapter 15 · 🖼️ Convolutional Neural Networks

A plain network treats every input the same. Images have spatial structure, and exploiting it — local patterns, shared filters — gives the convolutional network that ignited modern AI.

📖 All chapters | ← 13 · 🕸️ Probabilistic Graphical Models | 15 · 🖼️ Convolutional Neural Networks →