flowchart LR x[x] --> g["g = wx + b"] g --> a["a = ReLU(g)"] a --> L["L = (a - y)²"] L -. "dL/da" .-> a a -. "da/dg" .-> g g -. "dg/dw" .-> x
Chapter 02 — ∂ Calculus & Differentiation
📖 All chapters | ← 01 · 🧮 Linear Algebra | 03 · 📉 Optimization →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
Calculus is the mathematics of change, and machine learning is the art of changing model parameters until predictions get better. That is why this subfield sits at the very heart of training: every gradient-descent step, every backpropagation pass, every “the loss went down” moment is calculus running underneath. This chapter builds the machinery — slopes, partial derivatives, gradients, the chain rule, Taylor approximation, a little integration, and the numerical tricks that keep it all from blowing up.
🧭 In context: Mathematical Foundations · used to measure how a loss changes when you nudge a parameter, so optimizers know which way to step · the one key idea: the gradient points uphill, so step against it.
💡 Remember this: every “the loss went down” moment in machine learning is the chain rule computing a gradient and an optimizer stepping in the opposite direction.
Here is the whole chapter in one picture: a ball, placed on a loss curve, rolls downhill — each frame it reads the local slope (the derivative) and steps against it until it settles in the valley. Everything below is the machinery that makes this little roll precise.
2.1 — Derivatives & differentiation (slope, rules)
A derivative answers one question: if I nudge the input a tiny bit, how much does the output move, and in which direction? It is the slope of a function at a point — the steepness of the tangent line that just kisses the curve there.
The plain-language picture: imagine driving and watching your odometer. Position is the function; the derivative is your speedometer — the instantaneous rate of change of position. Over a tiny instant \(h\), you moved \(f(x+h) - f(x)\); divide by the time \(h\) and shrink \(h\) toward zero:
\[f'(x) = \lim_{h \to 0} \frac{f(x+h) - f(x)}{h}\]
In words: the derivative is what the average slope between two points settles down to as you slide those two points infinitely close together.
Also written: \(\dfrac{df}{dx} = \displaystyle\lim_{\Delta x \to 0} \dfrac{\Delta f}{\Delta x}\) — the same limit using Leibniz’s \(\Delta\) (“change in”) notation; \(f'(x)\) (Lagrange) and \(\frac{df}{dx}\) (Leibniz) name the identical object.
That limit is the formal definition of the derivative. Geometrically, the fraction is the slope of a secant line through two nearby points; as the points merge, the secant becomes the tangent.
In practice nobody recomputes that limit by hand. We use differentiation rules — a small algebra that turns one function into its derivative. The power rule handles polynomials, the sum rule lets you differentiate term by term, and the product and quotient rules handle multiplied and divided functions:
| Rule | Function | Derivative |
|---|---|---|
| Power | \(x^n\) | \(n x^{n-1}\) |
| Constant | \(c\) | \(0\) |
| Sum | \(f + g\) | \(f' + g'\) |
| Product | \(f \cdot g\) | \(f'g + fg'\) |
| Quotient | \(f/g\) | \((f'g - fg')/g^2\) |
| Exponential | \(e^x\) | \(e^x\) |
| Log | \(\ln x\) | \(1/x\) |
Worked example. Take \(f(x) = 3x^2 + 2x\). By the power and sum rules, \(f'(x) = 6x + 2\). At \(x = 1\) the slope is \(8\). Let us sanity-check with the limit definition using a small \(h = 0.001\): \(f(1.001) = 3(1.002001) + 2.002 = 5.008003\), and \(f(1) = 5\), so \((5.008003 - 5)/0.001 = 8.003\) — essentially \(8\), as the rule promised. The tiny leftover (\(0.003\)) is the approximation error from \(h\) not being exactly zero.
The derivative has a sign and a size. The sign says which direction is uphill; the size says how steep. A derivative of zero means flat — a peak, valley, or plateau. Optimizers live and die by these two facts.
The most common derivatives in ML are not polynomials but the activation-function derivatives. They are worth knowing by heart because they decide how strongly a gradient survives as it passes back through a neuron:
| Activation | \(f(x)\) | \(f'(x)\) | Why it matters |
|---|---|---|---|
| Sigmoid | \(\sigma(x) = \frac{1}{1+e^{-x}}\) | \(\sigma(x)(1-\sigma(x))\) | maxes at \(0.25\) → gradients shrink (vanishing) |
| Tanh | \(\tanh(x)\) | \(1 - \tanh^2(x)\) | maxes at \(1\), still saturates at the ends |
| ReLU | \(\max(0,x)\) | \(1\) if \(x>0\) else \(0\) | gradient is \(0\) or \(1\) — cheap, no shrink, but “dies” at \(0\) |
Notice the sigmoid derivative is written entirely in terms of \(\sigma(x)\) itself — so once the forward pass has computed the activation, the backward pass gets the derivative almost for free. That reuse is the seed of backpropagation in section 2.2.
# verify a hand-derivative with SymPy (symbolic) before trusting it in code
import sympy as sp
x = sp.symbols('x')
f = 3*x**2 + 2*x
print(sp.diff(f, x)) # 6*x + 2
print(sp.diff(sp.exp(-x)/(1+sp.exp(-x)), x)) # the sigmoid derivative, expanded🎮 Try it — Limits
🎮 Try it — Derivatives
2.2 — Partial derivatives & the chain rule (engine of backprop)
Real models have not one input but millions. A partial derivative is the derivative with respect to one variable while holding all the others fixed — written \(\partial f / \partial x\). The curly \(\partial\) (“partial”) just signals “there are other variables, but I am freezing them.”
Worked example. Let \(f(x, y) = x^2 y + 3y\). Holding \(y\) fixed, \(\partial f/\partial x = 2xy\). Holding \(x\) fixed, \(\partial f/\partial y = x^2 + 3\). At \((x, y) = (2, 1)\): \(\partial f/\partial x = 4\) and \(\partial f/\partial y = 7\). So near that point, nudging \(x\) moves the output about 4× the nudge, while nudging \(y\) moves it about 7×.
The chain rule is the real workhorse. It tells you how to differentiate compositions — a function of a function of a function. If \(z = f(g(x))\), then
\[\frac{dz}{dx} = \frac{dz}{dg} \cdot \frac{dg}{dx}\]
In words: to find how fast the output moves when you wiggle the input, multiply how fast the outer function reacts to its input by how fast the inner function reacts to yours — rates of change multiply down the chain.
Also written: \((f \circ g)'(x) = f'(g(x)) \, g'(x)\) — Lagrange’s prime form of the same rule; the Leibniz fractions \(\frac{dz}{dg}\frac{dg}{dx}\) “cancel” symbolically to leave \(\frac{dz}{dx}\).
Intuition: rates of change multiply. If revenue grows 2× as fast as users, and users grow 3× as fast as ad spend, then revenue grows \(2 \times 3 = 6\)× as fast as ad spend. Each link in the chain contributes a multiplier.
The animation below shows a gradient signal flowing backward through three links: it enters as \(1\) at the loss and gets multiplied by each link’s local derivative as it travels left, arriving at the input as the product of all three — that product is \(dz/dx\).
This is exactly how backpropagation works. A neural network is a long composition: input → linear layer → activation → linear layer → … → loss. To learn, we need \(\partial \text{loss}/\partial w\) for every weight \(w\) buried deep inside. The chain rule lets us compute these by starting at the loss and multiplying derivatives backward through the layers, reusing intermediate results.
When variables branch and recombine (one input feeding several paths), you also need the multivariate chain rule: sum the contributions over every path. For \(z = f(u, v)\) with \(u = u(x)\), \(v = v(x)\),
\[\frac{dz}{dx} = \frac{\partial z}{\partial u}\frac{du}{dx} + \frac{\partial z}{\partial v}\frac{dv}{dx}\]
In words: when the input reaches the output through more than one route, add up the rate along each route.
Also written: \(\frac{dz}{dx} = \sum_{k} \frac{\partial z}{\partial u_k}\frac{du_k}{dx}\) — the general “sum over all intermediate variables \(u_k\)” form, which is precisely what a framework does when a tensor is used in several places.
Worked example (branching). Let \(u = x^2\) and \(v = 3x\), and \(z = u + v\). The input \(x\) reaches \(z\) by two routes. Route through \(u\): \(\frac{\partial z}{\partial u}\frac{du}{dx} = (1)(2x)\). Route through \(v\): \(\frac{\partial z}{\partial v}\frac{dv}{dx} = (1)(3)\). Add them: \(\frac{dz}{dx} = 2x + 3\). Check directly: \(z = x^2 + 3x\), so \(\frac{dz}{dx} = 2x + 3\). The two routes summed give the right answer — and when a network reuses the same activation in several places, the framework adds up exactly these per-route contributions.
Worked example (one neuron). Let \(g = wx\), \(a = g^2\), \(L = a\). Then by the chain rule \(\frac{dL}{dw} = \frac{dL}{da}\frac{da}{dg}\frac{dg}{dw} = (1)(2g)(x) = 2wx^2\). With \(w = 3, x = 2\): \(g = 6\), and \(dL/dw = 2 \cdot 3 \cdot 4 = 24\). Check directly: \(L = (wx)^2 = w^2 x^2 = 4w^2\), so \(dL/dw = 8w = 24\). The chain rule and the direct computation agree.
# the same gradient, the backprop way: forward, then multiply backward
w, x = 3.0, 2.0
g = w * x # forward
a = g**2
L = a
# backward: start with dL/dL = 1, chain through
dL_da = 1.0
da_dg = 2 * g # d(g^2)/dg
dg_dw = x # d(wx)/dw
dL_dw = dL_da * da_dg * dg_dw
print(dL_dw) # 24.0A real framework does this multiplication automatically. The same neuron in PyTorch, where .backward() walks the chain rule for us:
import torch
w = torch.tensor(3.0, requires_grad=True)
x = torch.tensor(2.0)
L = (w * x) ** 2 # build the graph: g = wx, a = g^2, L = a
L.backward() # apply the chain rule backward
print(w.grad) # tensor(24.) — matches our by-hand resultBackprop reuses the forward-pass values (\(g\), \(a\), …). If you overwrite them or forget to store them, the backward pass computes garbage. This is why frameworks keep a “computation graph” of intermediate activations in memory — and why long sequences run out of GPU RAM.
🎮 Try it — Partial Derivatives
🎮 Try it — Chain Rule
2.3 — Gradient, Jacobian & Hessian
When a function has many inputs, its derivatives organize into structured objects. Which object you get depends on how many inputs and outputs you have.
The gradient \(\nabla f\) collects all the first-order partials of a scalar function into a vector:
\[\nabla f = \left[\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, \dots, \frac{\partial f}{\partial x_n}\right]\]
In words: the gradient is just a tidy list — one slope per input direction — bundled into a single vector that points the way the function rises fastest.
Also written: \(\nabla f = \left(\frac{\partial f}{\partial x_1}, \dots, \frac{\partial f}{\partial x_n}\right)^{\!\top}\), sometimes denoted \(\operatorname{grad} f\) or \(\partial f / \partial \mathbf{x}\) — the same object as a column vector.
Its defining property is the one that makes optimization work: the gradient points in the direction of steepest ascent, and its negative points downhill. That is the whole basis of gradient descent — to minimize, step against the gradient.
The Jacobian generalizes this to vector-valued functions — many inputs and many outputs. If \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\), the Jacobian \(J\) is the \(m \times n\) matrix whose entry \(J_{ij} = \partial f_i / \partial x_j\). Each row is the gradient of one output. Jacobians show up whenever you differentiate a layer that maps a vector to a vector (e.g. a softmax).
The Hessian \(H\) holds the second derivatives of a scalar function — the curvature. It is the \(n \times n\) matrix \(H_{ij} = \partial^2 f / \partial x_i \partial x_j\). Where the gradient tells you which way is downhill, the Hessian tells you whether the ground is curving up like a bowl (minimum), down like a dome (maximum), or twisting like a saddle.
| Object | Input → output | Shape | Captures |
|---|---|---|---|
| Gradient \(\nabla f\) | \(\mathbb{R}^n \to \mathbb{R}\) | vector (\(n\)) | slope in every direction |
| Jacobian \(J\) | \(\mathbb{R}^n \to \mathbb{R}^m\) | matrix (\(m \times n\)) | how each output reacts to each input |
| Hessian \(H\) | \(\mathbb{R}^n \to \mathbb{R}\) | matrix (\(n \times n\)) | curvature (second order) |
A useful way to see the family: each object is the derivative of the one above it. The gradient is the derivative of a scalar; the Hessian is the Jacobian of the gradient (the derivative of a derivative).
Worked example. For \(f(x, y) = x^2 + 3y^2\): the gradient is \(\nabla f = [2x, 6y]\), and the Hessian is the constant matrix \(H = \begin{bmatrix} 2 & 0 \\ 0 & 6 \end{bmatrix}\). Both diagonal entries are positive, so the bowl curves upward in every direction — the point \((0,0)\) is a genuine minimum. The unequal entries (\(2\) vs \(6\)) mean the bowl is steeper along \(y\) than \(x\), which is exactly the kind of “stretched bowl” that makes plain gradient descent zig-zag.
# autodiff gives gradient, Jacobian, and Hessian in a few lines (PyTorch)
import torch
def f(v): # f(x,y) = x^2 + 3y^2
x, y = v
return x**2 + 3*y**2
p = torch.tensor([1.0, 2.0])
g = torch.autograd.functional.jacobian(f, p) # gradient: tensor([2., 12.])
H = torch.autograd.functional.hessian(f, p) # [[2,0],[0,6]]
print(g, H, sep="\n")Rule of thumb on cost: the gradient has \(n\) numbers, the Hessian has \(n^2\). For a model with a billion parameters, the gradient is feasible but the full Hessian is unthinkable. This is why almost all deep-learning optimizers are first-order (gradient only) and merely approximate curvature.
🎮 Try it — Gradients
🎮 Try it — Jacobian
🎮 Try it — Hessian
2.4 — Directional derivatives & the steepest-descent direction
We have been speaking of “the slope,” but in many dimensions there is a slope in every direction you could walk. The directional derivative asks: standing at a point, if I step along a chosen unit direction \(\mathbf{u}\), how fast does the function change? The answer is a dot product with the gradient:
\[D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}\]
In words: the rate of change in any direction is the gradient projected onto that direction — how much your chosen heading lines up with the uphill arrow.
Also written: \(D_{\mathbf{u}} f = \|\nabla f\|\,\|\mathbf{u}\|\cos\theta = \|\nabla f\|\cos\theta\) (since \(\|\mathbf{u}\|=1\)), where \(\theta\) is the angle between your direction and the gradient.
That second form is the intuition machine learning runs on. The slope is largest when \(\cos\theta = 1\) — that is, when \(\mathbf{u}\) points exactly along the gradient (steepest ascent). It is most negative when \(\cos\theta = -1\), when you walk opposite the gradient (steepest descent). And when \(\mathbf{u}\) is perpendicular to the gradient (\(\cos\theta = 0\)) the function is momentarily flat — you are walking along a contour line.
Worked example. Take \(f(x,y) = x^2 + y^2\) at the point \((3, 4)\), so \(\nabla f = [6, 8]\). Walk in the direction \(\mathbf{u} = [1, 0]\) (pure \(+x\)): the directional derivative is \([6,8]\cdot[1,0] = 6\). Walk along the gradient’s own direction, the unit vector \([6,8]/10 = [0.6, 0.8]\): the rate is \([6,8]\cdot[0.6,0.8] = 3.6 + 6.4 = 10 = \|\nabla f\|\) — the maximum possible. No direction climbs faster than straight along the gradient, which is why gradient descent steps along \(-\nabla f\).
This is the rigorous reason behind the chapter’s one-line summary. “Step against the gradient” is not a heuristic — among all unit directions, \(-\nabla f\) is provably the one that decreases \(f\) fastest, because the dot product is minimized when the angle is \(180°\).
2.5 — Taylor series & local approximation
A Taylor series rebuilds a complicated function near a point out of its derivatives there. The idea: if you know a function’s value, slope, curvature, and so on at one spot, you can predict its values nearby — better and better as you add terms.
Around a point \(a\):
\[f(x) \approx f(a) + f'(a)(x-a) + \tfrac{1}{2}f''(a)(x-a)^2 + \dots\]
In words: start at the known height \(f(a)\), tilt by the slope, bend by half the curvature, and keep adding finer corrections — each term fixes the error left by the previous ones.
Also written: \(f(x) = \displaystyle\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}(x-a)^k\) — the compact summation form, where \(f^{(k)}\) is the \(k\)-th derivative and \(0! = 1\).
The first two terms are just the tangent line — a linear approximation. Add the third (curvature) term and you get a parabola that hugs the curve more tightly. Each extra derivative buys you accuracy over a wider neighborhood.
Why ML cares: optimization methods are local approximations. Gradient descent treats the loss surface as locally linear (a first-order Taylor model) and steps along the slope. Newton’s method goes one order further — it fits a quadratic using the Hessian and jumps straight to that parabola’s minimum, which is why it converges in fewer steps when you can afford the curvature.
Worked example. Approximate \(e^x\) near \(a = 0\). The derivatives of \(e^x\) are all \(e^x\), equal to \(1\) at \(0\), so \(e^x \approx 1 + x + \tfrac{1}{2}x^2\). At \(x = 0.1\): the approximation gives \(1 + 0.1 + 0.005 = 1.105\), while the true value is \(1.10517\ldots\) — accurate to four decimals from just three terms. Push to \(x = 1\) and the same three terms give \(2.5\) versus the true \(2.718\) — still close, but the error grows as you move away from \(a\), exactly as expected.
# watch accuracy grow with each Taylor term for e^x near 0
import numpy as np
from math import factorial
x = 1.0
for n_terms in range(1, 6):
approx = sum(x**k / factorial(k) for k in range(n_terms))
print(n_terms, round(approx, 5)) # ... converges toward 2.71828
print("true", round(np.exp(x), 5))A Taylor approximation is only trustworthy near the expansion point. Optimizers that take an overly large step are trusting a local quadratic model far outside its valid neighborhood — which is precisely why too-big a learning rate makes training diverge. This is the idea behind “trust region” methods.
🎮 Try it — Taylor Series
2.6 — Integration (and where it appears in ML)
If differentiation is about rates, integration is about accumulation — adding up infinitely many infinitesimal pieces. The definite integral \(\int_a^b f(x)\,dx\) is the signed area under the curve \(f\) between \(a\) and \(b\). The two operations are inverses: the Fundamental Theorem of Calculus says integrating a derivative gives back the original function.
In ML, integration shows up most often as expectation. The expected value of a function \(g\) under a probability density \(p\) is an integral:
\[\mathbb{E}_{x \sim p}[g(x)] = \int g(x)\, p(x)\, dx\]
In words: the expectation is a weighted average of \(g\) — every possible value of \(x\) contributes \(g(x)\), weighted by how likely that \(x\) is under \(p\).
Also written: for a discrete distribution this becomes a sum, \(\mathbb{E}[g(x)] = \sum_x g(x)\,p(x)\) — integration is just the continuous limit of that weighted sum.
This single formula hides under enormous amounts of ML: the expected loss over the data distribution, the mean of a Gaussian, the normalizing constant that makes a probability density integrate to \(1\), the area under the ROC curve, and the marginal likelihood in Bayesian models. Probabilities themselves are integrals of densities over a region.
The catch: most of these integrals have no closed form. You cannot solve them with algebra. So ML almost always approximates them — most commonly by Monte Carlo: replace the integral with an average over random samples, since by the law of large numbers \(\frac{1}{N}\sum_i g(x_i) \to \mathbb{E}[g(x)]\) as samples accumulate.
The picture below is Monte Carlo in miniature: dots rain down at random, and the fraction that land under the curve estimates its area. More dots, tighter estimate.
Worked example (Monte Carlo). Estimate \(\int_0^1 x^2\,dx\), whose exact value is \(1/3\).
import numpy as np
rng = np.random.default_rng(0)
xs = rng.uniform(0, 1, size=100_000) # sample x uniformly on [0,1]
est = np.mean(xs**2) # average of g(x)=x^2 = E[x^2]
print(est) # ~0.3334, close to 1/3Because \(p(x)\) is uniform on \([0,1]\) (density \(=1\)), the integral equals \(\mathbb{E}[x^2]\), and the sample mean estimates it. With 100k samples we land within a thousandth of the true \(0.3333\ldots\). Monte Carlo is the engine behind everything from dropout estimates to the expectations inside reinforcement learning and variational inference.
Whenever you see “expected loss,” “marginal likelihood,” “average reward,” or “area under the curve,” translate it mentally as an integral we will probably approximate by sampling. That reframing demystifies most of probabilistic ML.
🎮 Try it — Integration
2.7 — Numerical methods & stability
Calculus on paper assumes infinite precision. Computers do not have it — they store numbers in floating point, with a fixed number of bits. Knowing how derivatives and integrals are actually computed, and how they fail, is the difference between a model that trains and one that silently fills with NaN.
Finite differences. The crudest way to get a derivative numerically is to use the limit definition with a small but nonzero \(h\): \(f'(x) \approx (f(x+h) - f(x))/h\). The central difference \((f(x+h) - f(x-h))/(2h)\) is more accurate. This is mostly used for gradient checking — verifying that hand-written or autodiff gradients are correct.
There is a tension in picking \(h\): too large and the approximation is inaccurate (you are not close to the limit); too small and catastrophic cancellation wrecks you — subtracting two nearly equal floating-point numbers loses most of the significant digits.
flowchart TD
A["Need a derivative numerically"] --> B{How?}
B -->|"finite differences"| C["easy, but pick h carefully<br/>used for gradient checking"]
B -->|"symbolic"| D["exact formula, but<br/>expression explodes in size"]
B -->|"automatic diff (autodiff)"| E["exact to machine precision,<br/>cheap — powers all DL frameworks"]
Autodiff vs symbolic. There are three ways to get derivatives in a program, and they trade off accuracy against cost:
| Method | What it does | Cost / problem |
|---|---|---|
| Symbolic | Manipulates formulas algebraically (like by hand) | Expressions blow up (“expression swell”) |
| Numerical (finite diff) | Plugs small \(h\) into the limit | Approximate; cancellation error |
| Automatic (autodiff) | Applies the chain rule to elementary ops as the code runs | Exact, cheap — the foundation of PyTorch/JAX |
Automatic differentiation is the winner for deep learning. It records the elementary operations of the forward pass and applies the chain rule mechanically, giving derivatives correct to machine precision at roughly the cost of one extra pass. It is neither a symbolic formula nor an approximation — it is exact arithmetic on the actual computational graph (this is the backprop of section 2.2, generalized).
Forward vs reverse mode. Autodiff comes in two flavors. The simplest way to see the difference: the chain rule is a chain of multiplications, and you can do them starting from the input end or the output end. One pass per input end you start from — so you want to start from whichever end has fewer things.
- Forward mode starts at an input and carries its derivative forward through every step, one input at a time. One sweep tells you how all outputs react to that one input. Cheap when there are few inputs and many outputs.
- Reverse mode does one normal forward pass to record the steps, then sweeps backward from the output. One sweep tells you how that one output reacts to all inputs. Cheap when there are many inputs and one output.
Deep learning is the second case to the extreme: millions of parameters in, one scalar loss out. So you run reverse mode — one backward sweep hands you the gradient for every parameter at once. Reverse mode is backpropagation.
# reverse-mode autodiff: one scalar loss, gradients for all inputs at once
import torch
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
loss = (x ** 2).sum() # one forward pass builds the graph
loss.backward() # one backward sweep fills every gradient
print(x.grad) # tensor([2., 4., 6.]) = d(sum x^2)/dxFloating-point hazards: overflow and underflow. A 32-bit float maxes out around \(3.4 \times 10^{38}\). Compute \(e^{1000}\) and you get inf (overflow); compute \(e^{-1000}\) and you get \(0\) (underflow). Both are disasters when they feed into a division or a logarithm.
The log-sum-exp trick. This hazard bites constantly in softmax and cross-entropy, which compute \(\log \sum_i e^{x_i}\). If any \(x_i\) is large, \(e^{x_i}\) overflows. The fix exploits a clean algebraic identity: subtract the max \(m = \max_i x_i\) before exponentiating, then add it back:
\[\log \sum_i e^{x_i} = m + \log \sum_i e^{x_i - m}\]
In words: pull the biggest exponent out front so the largest thing you ever exponentiate is \(e^0 = 1\); the subtraction inside and the addition outside cancel, so the answer does not change — only the overflow does.
Also written: \(\operatorname{LSE}(x_1,\dots,x_n) = m + \log\!\big(\sum_i e^{x_i - m}\big)\) with \(m = \max_i x_i\) — the “shift-by-max” form used inside every numerically stable softmax.
Now the largest exponent is \(e^{0} = 1\) — no overflow — and the result is mathematically identical.
import numpy as np
def logsumexp(x):
m = np.max(x)
return m + np.log(np.sum(np.exp(x - m))) # shift by max → stable
x = np.array([1000.0, 1001.0, 1002.0])
# np.log(np.sum(np.exp(x))) -> inf (overflow!)
print(logsumexp(x)) # 1002.407... correct and finiteIn practice you reach for the library version rather than rolling your own — scipy.special.logsumexp, or torch.logsumexp / F.log_softmax, all of which apply this shift internally:
from scipy.special import logsumexp as lse
import numpy as np
print(lse(np.array([1000.0, 1001.0, 1002.0]))) # 1002.407..., no overflowWorked example. Naively, \(e^{1000}\) is already inf, so the naive log-sum-exp returns inf. The stabilized version subtracts \(m = 1002\), exponentiates \(\{e^{-2}, e^{-1}, e^{0}\} = \{0.135, 0.368, 1\}\), sums to \(1.503\), takes the log (\(0.407\)), and adds \(1002\) back — giving \(1002.407\), finite and exact. Every serious softmax implementation does this internally.
NaN in your loss is almost always a numerical-stability bug, not a math error: a \(\log(0)\), a \(0/0\), an \(e^{\text{big}}\), or a \(\sqrt{\text{negative}}\). Reach for the stable primitive (logsumexp, log1p, expm1, adding a small \(\epsilon\) inside logs and denominators) before you suspect your model.
🎮 Try it — Numerical Differentiation
🎮 Try it — Numerical Optimization
2.8 — Quick reference
| Term / formula | Meaning in one line | When / why it matters |
|---|---|---|
| Derivative \(f'(x)\) | Slope: rate of change of output per unit input | Sign = uphill direction, size = steepness; drives every optimizer step |
| Power rule \(x^n \to nx^{n-1}\) | Differentiate polynomials term by term | The everyday algebra of hand-derivatives |
| Chain rule \(\frac{dz}{dx}=\frac{dz}{dg}\frac{dg}{dx}\) | Rates multiply down a composition | This is backpropagation through layers |
| Partial \(\partial f/\partial x\) | Derivative w.r.t. one variable, others frozen | How a single weight affects the loss |
| Multivariate chain rule | Sum contributions over every path | When a tensor feeds several places in the graph |
| Gradient \(\nabla f\) | Vector of all first-order partials | Points uphill; step against it to minimize |
| Jacobian \(J\) (\(m\times n\)) | Each output’s gradient stacked | Differentiating vector→vector layers (softmax) |
| Hessian \(H\) (\(n\times n\)) | Second derivatives = curvature | Bowl vs saddle; \(n^2\) cost keeps DL first-order |
| Directional deriv. \(\nabla f\cdot\mathbf{u}\) | Rate along a chosen heading | Proves \(-\nabla f\) is steepest descent |
| Taylor series | Rebuild \(f\) near \(a\) from its derivatives | Gradient descent = 1st order, Newton = 2nd order |
| Integral \(\int f\,dx\) | Signed area / accumulation | Becomes an expectation all over ML |
| Expectation \(\mathbb{E}[g]=\int g\,p\,dx\) | Probability-weighted average | Expected loss, marginal likelihood, mean reward |
| Monte Carlo | Average over random samples ≈ integral | Estimates intractable integrals (RL, VI) |
| Autodiff (reverse mode) | Chain rule on elementary ops, output→input | Exact, cheap gradients for all params = backprop |
| Finite difference \((f(x+h)-f(x-h))/2h\) | Numerical derivative with small \(h\) | Gradient checking; beware cancellation error |
| Log-sum-exp \(m+\log\sum e^{x_i-m}\) | Shift by max before exponentiating | Stable softmax/cross-entropy; dodges overflow |
2.9 — Key takeaways
- A derivative is a slope: the signed rate of change of output with respect to input. Sign = direction uphill, magnitude = steepness.
- Partial derivatives freeze all but one variable; the chain rule multiplies rates through a composition — this is backpropagation.
- The gradient is the vector of partials and points uphill; the Jacobian handles vector→vector maps; the Hessian captures curvature but costs \(n^2\), so deep learning stays first-order.
- The directional derivative \(\nabla f \cdot \mathbf{u}\) shows that among all directions, \(-\nabla f\) is provably the steepest descent — the rigorous reason gradient descent works.
- Taylor series approximate a function locally from its derivatives; gradient descent is a first-order model, Newton’s method a second-order one — and both are untrustworthy far from the expansion point.
- Integration in ML almost always means an expectation, usually intractable and estimated by Monte Carlo sampling.
- Real computation lives in floating point: prefer autodiff (reverse mode = backprop) over finite differences or symbolic math, and use the log-sum-exp trick and friends to dodge overflow/underflow and
NaN.
2.10 — See also
- Chapter 01 — Linear Algebra — the vectors and matrices that gradients, Jacobians, and Hessians are built from.
- Chapter 03 — Optimization — gradient descent, Newton’s method, and learning rates that put these derivatives to work.
- Chapter 04 — Probability & Statistics — expectations, densities, and the distributions the integrals here describe.
- Chapter 14 — Neural Networks (Core) — backpropagation as the chain rule applied across layers.
- Chapter 18 — Generative Models — Monte Carlo expectations and variational inference in full.
↪ The thread continues → Chapter 03 · 📉 Optimization
Calculus tells you which way is downhill; Optimization is the disciplined art of actually walking down — the algorithms (SGD, Adam, and their kin) that every model on earth uses to learn.
📖 All chapters | ← 01 · 🧮 Linear Algebra | 03 · 📉 Optimization →