Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 3.1 — Gradient descent
  • 3.2 — Stochastic gradient descent (SGD)
  • 3.3 — Momentum & Nesterov accelerated gradient
  • 3.4 — AdaGrad / RMSProp / AdaDelta
  • 3.5 — Adam / Adamax / Nadam
  • 3.6 — Learning rate & schedules (warmup, decay, cyclical)
  • 3.7 — Newton & quasi-Newton methods (BFGS / L-BFGS)
  • 3.8 — Conjugate gradient & line search
  • 3.9 — Convex optimization & duality
  • 3.10 — Lagrange multipliers & the KKT conditions
  • 3.11 — Coordinate descent
  • 3.12 — Constrained optimization & linear programming
  • 3.13 — Evolutionary / genetic optimization
  • 3.14 — Subgradient & proximal methods
  • 3.15 — Gradient checking & numerical gradients
  • 3.16 — Quick reference
  • 3.17 — Key takeaways
  • 3.18 — See also

Chapter 03 — 📉 Optimization

📖 All chapters  |  ← 02 · ∂ Calculus & Differentiation  |  04 · 🎲 Probability & Statistics →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

Optimization is the engine room of machine learning. A model “learns” by searching for the parameters that make a loss function as small as possible, and every algorithm in this book — from linear regression to a trillion-parameter transformer — is, underneath, a procedure for minimizing some function. This chapter is about how that search is actually carried out: the update rules, the tricks that make them fast, and the classical theory that tells us when we can trust the answer.

🧭 In context: Mathematical Foundations · how models actually learn their parameters · the one key idea — follow the negative gradient downhill, and add cleverness (momentum, adaptivity, schedules) to get there faster and more reliably.

💡 Remember this: Almost all of machine learning is one move repeated — take the negative gradient of a loss and step against it; everything else (momentum, adaptivity, schedules, curvature, constraints) is a refinement of how big and in which direction that step should be.

3.1 — Gradient descent

Imagine you are standing on a foggy hillside and want to reach the lowest point in the valley. You cannot see far, but you can feel the slope under your feet. The greedy strategy is simple: take a step in the steepest downhill direction, then feel the slope again and repeat. That is gradient descent (GD), the foundational optimization algorithm of machine learning.

Formally, we want to minimize a loss function \(L(\theta)\) over parameters \(\theta\). The gradient \(\nabla L(\theta)\) is the vector of partial derivatives; it points in the direction of steepest ascent. So to go downhill we step in the opposite direction:

\[\theta_{t+1} = \theta_t - \eta \, \nabla L(\theta_t)\]

In words: the next parameter value is the current one nudged a small amount \(\eta\) against the slope, because the gradient points uphill and we want to go down.

Also written: \(\theta \leftarrow \theta - \eta\, g\) where \(g = \partial L/\partial \theta\) (assignment form), or componentwise \(\theta_{t+1}^{(j)} = \theta_t^{(j)} - \eta\,\frac{\partial L}{\partial \theta^{(j)}}\).

The scalar \(\eta > 0\) is the learning rate (or step size): how big a step we take. Too small and we crawl; too large and we overshoot and diverge.

Worked example. Minimize \(L(\theta) = \theta^2\), whose gradient is \(L'(\theta) = 2\theta\) with a minimum at \(\theta^\* = 0\). Start at \(\theta_0 = 4\) with \(\eta = 0.1\):

\[\theta_1 = 4 - 0.1\cdot 8 = 3.2,\quad \theta_2 = 3.2 - 0.1\cdot 6.4 = 2.56,\quad \theta_3 = 2.048,\ \dots\]

Each step multiplies \(\theta\) by \((1 - 2\eta) = 0.8\), so we converge geometrically to \(0\). Notice what happens if \(\eta = 1.1\): the multiplier is \(1 - 2.2 = -1.2\), and \(|\theta|\) grows every step — divergence. The same simple loss is either tamed or blown up purely by the step size.

The loss surface below shows the path: each step lands lower on the bowl, with steps shrinking as the slope flattens near the bottom. Watch the ball ease downhill and slow as the slope flattens — exactly what the shrinking \(0.8\times\) multiplier does.

θ₀ (start) θ* L(θ)
import numpy as np
def gd(grad, theta, lr=0.1, steps=20):
    for _ in range(steps):
        theta = theta - lr * grad(theta)   # one downhill step
    return theta
print(gd(lambda t: 2*t, 4.0))   # -> ~0.046, heading to 0

For a convex bowl GD is guaranteed to reach the global minimum given a small enough \(\eta\). For the bumpy, non-convex surfaces of deep learning it finds a local minimum — which, empirically, is usually good enough.

A useful theoretical anchor: if \(L\) is convex and \(\beta\)-smooth (its gradient does not change faster than \(\beta\)), then any fixed step size \(\eta \le 1/\beta\) guarantees the loss never increases and converges. This is why the maximum safe learning rate is tied to the curvature of the loss — the steeper the curvature, the smaller the safe step.

Tip

Rule of thumb: if your loss oscillates or explodes, your learning rate is too high — cut it by 3–10×. If it decreases painfully slowly in a smooth line, it is too low.

# PyTorch: the same downhill step, but autograd computes the gradient for you
import torch
theta = torch.tensor([4.0], requires_grad=True)
opt = torch.optim.SGD([theta], lr=0.1)
for _ in range(20):
    opt.zero_grad()
    loss = (theta**2).sum()   # L(theta) = theta^2
    loss.backward()           # fills theta.grad with dL/dtheta = 2*theta
    opt.step()                # theta <- theta - lr * theta.grad
print(theta.item())           # -> heading to 0

🎮 Try it — Objective Functions

🎮 Try it — Loss Functions

🎮 Try it — Batch Gradient Descent

3.2 — Stochastic gradient descent (SGD)

Plain gradient descent computes the gradient over the entire dataset before taking a single step. With a million training examples, that is a million gradient evaluations per step — agonizingly slow. Stochastic gradient descent (SGD) makes a bargain: instead of the exact gradient, use a cheap, noisy estimate from a small random sample, and take many more steps.

If the loss is an average over examples, \(L(\theta) = \frac{1}{N}\sum_{i=1}^N \ell_i(\theta)\), then a single random example \(i\) gives an unbiased estimate of the full gradient: \(\mathbb{E}_i[\nabla \ell_i(\theta)] = \nabla L(\theta)\). The update is

\[\theta_{t+1} = \theta_t - \eta \, \nabla \ell_{i_t}(\theta_t)\]

In words: instead of averaging the slope over all examples, look at just one randomly-picked example, trust its slope, and step. On average those single-example slopes point the same way as the true one.

Also written: \(\theta \leftarrow \theta - \eta\, g_{i}\) with \(g_i = \nabla \ell_i(\theta)\) and \(i \sim \text{Uniform}(1,\dots,N)\).

In practice we use a mini-batch of \(B\) examples (say 32–512) — a middle ground that smooths the noise and exploits vectorized hardware:

\[\theta_{t+1} = \theta_t - \eta \,\frac{1}{B}\sum_{i\in \mathcal{B}_t} \nabla \ell_i(\theta_t)\]

In words: average the slope over a small handful of examples rather than one or all; the average is steadier than a single example but far cheaper than the whole dataset.

Also written: \(\theta \leftarrow \theta - \eta\,\bar g_{\mathcal{B}}\) where \(\bar g_{\mathcal{B}} = \frac{1}{B}\sum_{i\in\mathcal{B}} g_i\) is the batch-mean gradient.

Why the noise helps. The randomness is not just a cost — it is a feature. The jitter lets the optimizer escape shallow local minima and saddle points that would trap exact GD, and it acts as a mild regularizer. The picture below contrasts the smooth, deliberate GD path with the jittery SGD path; both reach the basin, but SGD does it with far cheaper steps.

GD (smooth) SGD (noisy)

A mini-batch is just a small handful of examples drawn from the dataset, averaged into one cheap gradient, then fed to the update. The pipeline below shows examples streaming in, getting bundled into a batch, and producing one step:

dataset batch of B avg gradient θ←
def sgd_epoch(grad_i, theta, X, y, lr=0.01, B=32):
    idx = np.random.permutation(len(X))      # shuffle each epoch
    for s in range(0, len(X), B):
        b = idx[s:s+B]                        # a mini-batch
        g = grad_i(theta, X[b], y[b]).mean(0) # avg gradient over batch
        theta = theta - lr * g
    return theta

The trade-off is variance: pure SGD (\(B=1\)) is cheap but jumpy, requiring a decaying learning rate to settle; full-batch GD is stable but slow. Mini-batch SGD is the workhorse default of all of deep learning.

Warning

Common mistake: forgetting to shuffle the data between epochs. If batches always contain the same examples in the same order, the gradient noise becomes correlated and the model can learn the ordering rather than the signal.

# scikit-learn: SGDClassifier is mini-batch SGD under the hood for linear models
from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss="log_loss", learning_rate="optimal", max_iter=50)
clf.fit(X_train, y_train)   # logistic regression trained by SGD

🎮 Try it — SGD

🎮 Try it — Mini-Batch SGD

3.3 — Momentum & Nesterov accelerated gradient

SGD in a long, narrow ravine is frustrating: the gradient points mostly across the valley (steep walls) rather than along it (gentle floor), so the optimizer zig-zags slowly toward the minimum. Momentum fixes this by giving the optimizer inertia, like a heavy ball rolling downhill: it accumulates velocity in directions of consistent gradient and cancels out the oscillating components.

We keep a velocity vector \(v\) that is an exponentially decaying running average of past gradients:

\[v_{t+1} = \beta v_t + \nabla L(\theta_t), \qquad \theta_{t+1} = \theta_t - \eta\, v_{t+1}\]

In words: keep a running “memory” of which way you have been heading (the velocity), refresh it with the new slope, and step along that memory rather than along the raw slope. Steady directions build up speed; jittery ones cancel.

Also written: as a single line, \(\theta_{t+1} = \theta_t - \eta(\beta v_t + g_t)\); some libraries fold \(\eta\) into the velocity, \(v_{t+1} = \beta v_t + \eta g_t\), then \(\theta_{t+1} = \theta_t - v_{t+1}\).

The momentum coefficient \(\beta\) (typically \(0.9\)) controls how much past velocity is retained. Consistent gradients add up, so along the valley floor the effective step grows; oscillating gradients across the walls cancel, so the zig-zag dampens. A rough intuition: with \(\beta=0.9\) the velocity averages roughly the last \(1/(1-\beta)=10\) gradients, so the effective step along a steady direction is up to ~10× larger than a single gradient step.

Nesterov accelerated gradient (NAG) is a sharper variant. Instead of measuring the gradient where we currently stand, it first takes the momentum step to a lookahead point \(\theta_t - \eta\beta v_t\) and measures the gradient there. This “look before you leap” correction lets it respond to upcoming curvature and brakes earlier when overshooting:

\[v_{t+1} = \beta v_t + \nabla L(\theta_t - \eta\beta v_t), \qquad \theta_{t+1} = \theta_t - \eta v_{t+1}\]

In words: peek ahead to where momentum is about to carry you, measure the slope there instead of where you are now, and use that more up-to-date slope — so you can hit the brakes before overshooting.

Also written: equivalently, with the substitution \(\tilde\theta_t = \theta_t - \eta\beta v_t\), NAG is often implemented as a “look-ahead” gradient evaluated at \(\tilde\theta_t\).

plain SGD: zig-zags +momentum: glides
def momentum(grad, theta, lr=0.01, beta=0.9, steps=100):
    v = np.zeros_like(theta)
    for _ in range(steps):
        v = beta*v + grad(theta)     # accumulate velocity
        theta = theta - lr*v
    return theta

The cost is one extra parameter and a velocity buffer the size of \(\theta\). The benefit is often a 5–10× speedup in ill-conditioned valleys, which is why momentum (and its descendants) underlie nearly every modern optimizer.

# PyTorch: momentum and Nesterov are flags on the standard SGD optimizer
import torch
opt   = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
opt_n = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

🎮 Try it — Momentum

🎮 Try it — Nesterov Momentum

3.4 — AdaGrad / RMSProp / AdaDelta

So far every parameter shares one global learning rate. But features differ wildly in scale and frequency: a rare feature’s weight might need big steps, a common one’s small steps. Adaptive learning-rate methods give each parameter its own effective step size, shrinking it for parameters that have already received large gradients.

AdaGrad accumulates the sum of squared gradients per parameter and divides the step by its square root:

\[G_t = G_{t-1} + g_t^2, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t}+\epsilon}\, g_t\]

In words: keep a per-parameter running total of how big its gradients have been; the bigger that total, the more you shrink that parameter’s step. Busy parameters slow down, quiet ones keep moving.

Also written: elementwise, \(\theta^{(j)}_{t+1} = \theta^{(j)}_t - \eta\,g^{(j)}_t \big/ \big(\sqrt{\textstyle\sum_{s\le t}(g^{(j)}_s)^2}+\epsilon\big)\).

(all operations elementwise; \(\epsilon\approx 10^{-8}\) avoids division by zero). Frequently-updated parameters accumulate a large \(G\) and slow down; rarely-seen ones keep large steps. The fatal flaw: \(G_t\) only ever grows, so the effective learning rate decays monotonically to zero and learning stalls.

RMSProp fixes this by replacing the running sum with an exponentially-decaying average, so old gradients are forgotten:

\[E[g^2]_t = \rho\, E[g^2]_{t-1} + (1-\rho)\, g_t^2, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t}+\epsilon}\, g_t\]

In words: track a recent average of squared gradients instead of an all-time sum, so the brake forgets stale history and never grinds to a halt.

Also written: \(\theta \leftarrow \theta - \eta\, g / (\text{RMS}(g)+\epsilon)\), where \(\text{RMS}(g)=\sqrt{E[g^2]}\) is the root-mean-square of recent gradients — hence the name.

with decay \(\rho\approx 0.9\). The denominator now tracks recent gradient magnitude, so the step size adapts without dying. AdaDelta goes further and removes the global \(\eta\) entirely by also tracking a running average of squared updates, making the method scale-free.

Worked example (AdaGrad shrinks the step). Take one parameter with gradients \(g_1=1,\ g_2=1,\ g_3=1\) and \(\eta=0.1\). Then \(G_1=1\) so the step is \(0.1/\sqrt{1}=0.1\); \(G_2=2\) gives \(0.1/\sqrt{2}\approx 0.071\); \(G_3=3\) gives \(0.1/\sqrt{3}\approx 0.058\). Even with identical gradients the effective step keeps shrinking — helpful at first, fatal later as it crawls to zero. RMSProp with \(\rho=0.9\) instead settles \(E[g^2]\to 1\), holding the step near \(0.1\).

Method Accumulator Dies out? Needs global \(\eta\)?
AdaGrad sum of \(g^2\) (grows forever) yes yes
RMSProp EMA of \(g^2\) no yes
AdaDelta EMA of \(g^2\) and of \(\Delta\theta^2\) no no
def rmsprop(grad, theta, lr=0.01, rho=0.9, eps=1e-8, steps=100):
    Eg = np.zeros_like(theta)
    for _ in range(steps):
        g = grad(theta)
        Eg = rho*Eg + (1-rho)*g**2          # forgetful running avg
        theta = theta - lr*g/(np.sqrt(Eg)+eps)
    return theta
Tip

Intuition: the denominator \(\sqrt{E[g^2]}\) is a per-parameter “volatility” estimate. Noisy, large-gradient directions get damped; quiet directions get amplified — an automatic, dimension-by-dimension learning-rate tuner.

# PyTorch built-ins for the same family
import torch
opt_ada  = torch.optim.Adagrad(model.parameters(), lr=0.01)
opt_rms  = torch.optim.RMSprop(model.parameters(), lr=0.01, alpha=0.9)  # alpha == rho
opt_adel = torch.optim.Adadelta(model.parameters())                     # no lr needed

🎮 Try it — AdaGrad

🎮 Try it — RMSProp

3.5 — Adam / Adamax / Nadam

Adam (Adaptive Moment Estimation) is the default optimizer for most deep learning today. Its idea is simply to combine the two best tricks: the momentum of Section 3.3 (a running average of gradients, the first moment) and the per-parameter adaptivity of RMSProp (a running average of squared gradients, the second moment).

It maintains two EMAs:

\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \qquad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\]

In words: keep two running averages — one of the gradient itself (which way am I heading?) and one of its square (how big and noisy is it?). The first gives momentum; the second gives a per-parameter brake.

Also written: \(m_t = \text{EMA}_{\beta_1}(g)\) and \(v_t = \text{EMA}_{\beta_2}(g^2)\), the exponentially-weighted means of \(g\) and \(g^2\).

Because \(m\) and \(v\) start at zero, they are biased toward zero early on. Adam applies a bias correction to undo this:

\[\hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t}, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat v_t}+\epsilon}\,\hat m_t\]

In words: early on the averages start near zero and underestimate the truth, so divide each by how much “warm-up” has happened (\(1-\beta^t\)) to rescale it back up, then step like momentum but with a per-parameter brake.

Also written: the effective per-step rule is \(\theta \leftarrow \theta - \eta\,\hat m_t / (\sqrt{\hat v_t}+\epsilon)\), i.e. a momentum step divided by a recent-volatility estimate.

Typical defaults — \(\beta_1=0.9\), \(\beta_2=0.999\), \(\eta=10^{-3}\) — work remarkably often, which is why Adam is the go-to first try.

Worked example (bias correction). At step \(t=1\) with \(\beta_1=0.9\) and a first gradient \(g_1=0.2\): \(m_1 = 0.9\cdot 0 + 0.1\cdot 0.2 = 0.02\). That badly underestimates the true gradient. Dividing by \(1-\beta_1^1 = 0.1\) gives \(\hat m_1 = 0.02/0.1 = 0.2\) — the correct value restored.

Variants. Adamax replaces the \(L^2\) norm of past gradients (the \(\sqrt{v}\) term) with an \(L^\infty\) norm (a running max), which can be more stable. Nadam folds Nesterov’s lookahead into Adam’s first moment, getting NAG’s earlier braking on top of Adam’s adaptivity.

def adam(grad, theta, lr=1e-3, b1=0.9, b2=0.999, eps=1e-8, steps=200):
    m = np.zeros_like(theta); v = np.zeros_like(theta)
    for t in range(1, steps+1):
        g = grad(theta)
        m = b1*m + (1-b1)*g            # 1st moment (momentum)
        v = b2*v + (1-b2)*g**2         # 2nd moment (adaptivity)
        mh = m/(1-b1**t); vh = v/(1-b2**t)   # bias correction
        theta = theta - lr*mh/(np.sqrt(vh)+eps)
    return theta
Warning

Common mistake: assuming Adam always beats SGD. For many vision tasks, well-tuned SGD-with-momentum generalizes better than Adam, even if Adam trains faster. A frequent fix is AdamW, which decouples weight decay from the adaptive step and often closes the gap.

# PyTorch: Adam and its sibling AdamW (decoupled weight decay) are one-liners
import torch
opt  = torch.optim.Adam(model.parameters(),  lr=1e-3, betas=(0.9, 0.999))
optw = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)  # transformer default

🎮 Try it — Adam

🎮 Try it — AdamW

Which optimizer should I use? A quick chooser across the gradient-based methods of Sections 3.1–3.5:

flowchart TD
    A[Pick an optimizer] --> B{Smooth, deterministic,<br/>full-batch problem?}
    B -- yes --> C[L-BFGS / Newton<br/>see 3.7]
    B -- no --> D{Training a<br/>transformer / LLM?}
    D -- yes --> E[AdamW + warmup<br/>then cosine decay]
    D -- no --> F{Vision / CNN where<br/>you can tune carefully?}
    F -- yes --> G[SGD + momentum<br/>often best generalization]
    F -- no --> H[Adam<br/>safe default first try]

3.6 — Learning rate & schedules (warmup, decay, cyclical)

The learning rate is the single most important hyperparameter, and it rarely should stay constant. Early in training, large steps make fast progress; late in training, small steps are needed to settle precisely into a minimum. A learning-rate schedule \(\eta_t\) varies the step size over time to get the best of both.

Decay schedules shrink \(\eta\) as training proceeds. Common forms: step decay (multiply by \(0.1\) every \(K\) epochs), exponential decay (\(\eta_t = \eta_0 e^{-\lambda t}\)), and the popular cosine annealing, which eases the rate smoothly down to near zero along a cosine curve:

\[\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_0-\eta_{\min})\left(1+\cos\frac{\pi t}{T}\right)\]

In words: glide the learning rate from its starting value down to almost nothing along the smooth shape of a cosine curve — fast at first, gentle near the end, with no abrupt drops.

Also written: with \(\eta_{\min}=0\) this simplifies to \(\eta_t = \tfrac{\eta_0}{2}\left(1+\cos\frac{\pi t}{T}\right)\), a half-cosine from \(\eta_0\) down to \(0\) over \(T\) steps.

Warmup does the opposite at the very start: it ramps \(\eta\) up linearly from near zero over the first few hundred steps. Why? Early gradients (especially in transformers, where Adam’s second-moment estimate is noisy at step 1) can be wild; a gentle warmup prevents an early blow-up. Warmup-then-cosine-decay is the standard recipe for training large models.

Cyclical learning rates oscillate \(\eta\) up and down repeatedly between bounds. The periodic increases help the optimizer hop out of sharp minima and explore, often improving generalization, and enable “snapshot ensembles” — saving the model at each cycle’s low point.

η step warmup → cosine decay cyclical

The dot below rides the warmup-then-cosine-decay curve: it climbs fast during warmup, then glides smoothly down toward zero — the shape every large-model recipe uses.

η step warmup cosine decay →
import math
def cosine_warmup(step, base=1e-3, warmup=500, total=10000):
    if step < warmup:
        return base * step / warmup                       # linear ramp up
    p = (step - warmup) / (total - warmup)                # 0..1
    return 0.5 * base * (1 + math.cos(math.pi * p))       # cosine to ~0
Tip

Rule of thumb (LR range test): train for a few hundred steps while increasing \(\eta\) exponentially and plot loss vs \(\eta\). Pick a base rate a notch below where the loss starts to climb — a fast, reliable way to find a good starting point.

# PyTorch: schedulers wrap any optimizer; step() them once per iteration/epoch
import torch
opt   = torch.optim.AdamW(model.parameters(), lr=1e-3)
sched = torch.optim.lr_scheduler.CosineAnnealingLR(opt, T_max=10000)
for step in range(10000):
    opt.zero_grad(); loss = compute_loss(); loss.backward()
    opt.step(); sched.step()      # decay the LR along a cosine curve

🎮 Try it — Learning Rate Scheduling

🎮 Try it — Early Stopping

🎮 Try it — Weight Decay

3.7 — Newton & quasi-Newton methods (BFGS / L-BFGS)

Gradient descent uses only the slope (first derivative). Newton’s method also uses the curvature (second derivative) to take a much smarter step. The intuition: approximate the loss near \(\theta\) by a parabola, then jump straight to that parabola’s minimum in one shot.

For a multivariate loss, the curvature is the Hessian \(H = \nabla^2 L\), the matrix of all second partial derivatives. Newton’s update is

\[\theta_{t+1} = \theta_t - H^{-1}\, \nabla L(\theta_t)\]

In words: instead of just stepping against the slope, first ask how the slope itself is bending (the curvature), and use that to rescale the step — big steps where the bowl is flat, small steps where it is sharply curved.

Also written: \(\theta \leftarrow \theta - H^{-1} g\); in 1-D this is the familiar Newton root-finding step \(\theta \leftarrow \theta - L'(\theta)/L''(\theta)\).

Multiplying by \(H^{-1}\) rescales and rotates the gradient to account for the curvature — so it sails straight down a stretched valley that would make plain GD zig-zag. Worked example: for \(L(\theta)=\tfrac12\theta^\top A\theta\) with \(A\) positive-definite, \(\nabla L = A\theta\) and \(H=A\), so \(\theta_1 = \theta_0 - A^{-1}(A\theta_0) = 0\) — Newton reaches the exact minimum of any quadratic in a single step.

The catch: the Hessian is \(n\times n\) for \(n\) parameters. For a model with a million weights that is \(10^{12}\) entries to form and invert — utterly infeasible. Quasi-Newton methods sidestep this by building up an approximation of \(H^{-1}\) from the sequence of gradients observed, never computing the true Hessian.

BFGS maintains a dense inverse-Hessian estimate, updated each step from the change in position and gradient. L-BFGS (limited-memory BFGS) stores only the last \(m\) (say 10–20) update vectors instead of the full matrix, making it practical for large problems. It is the optimizer of choice for smooth, deterministic, medium-scale problems — logistic regression, CRFs, many scientific fits — though its reliance on accurate full-batch gradients makes it a poor fit for noisy mini-batch deep learning.

Method Uses curvature? Memory Best for
Gradient descent no \(O(n)\) huge, noisy problems
Newton exact Hessian \(O(n^2)\) small, smooth problems
BFGS approx Hessian \(O(n^2)\) medium smooth problems
L-BFGS approx (last \(m\) steps) \(O(mn)\) large smooth, full-batch
# scipy: L-BFGS-B is the standard smooth-objective solver; pass loss and gradient
from scipy.optimize import minimize
res = minimize(loss_fn, x0, jac=grad_fn, method="L-BFGS-B")
print(res.x, res.fun)   # optimum and its loss value
Warning

Common mistake: reaching for L-BFGS on a deep network trained with mini-batches. Its curvature estimate assumes consistent gradients; mini-batch noise corrupts it, and it usually underperforms plain Adam there.

3.8 — Conjugate gradient & line search

Two classical ideas refine how we choose a direction and how far to step.

Line search answers “how far?”. Given a descent direction \(d_t\), instead of a fixed \(\eta\) we search along the ray \(\theta_t + \alpha d_t\) for a step length \(\alpha\) that decreases the loss enough. We rarely solve this exactly; instead we accept any \(\alpha\) satisfying the Wolfe conditions — the Armijo (sufficient decrease) condition ensures the loss drops proportionally to the step, and the curvature condition rejects steps that are too short. A common implementation is backtracking: start with a big \(\alpha\) and halve it until sufficient decrease holds.

The Armijo sufficient-decrease condition, the workhorse of practical line search, reads

\[L(\theta_t + \alpha d_t)\ \le\ L(\theta_t) + c_1\,\alpha\, \nabla L(\theta_t)^\top d_t,\qquad 0 < c_1 < 1\]

In words: accept a step only if it actually lowers the loss by at least a small fraction \(c_1\) of what the slope predicted it would — this rejects steps that are too greedy and barely help.

Also written: with \(\phi(\alpha)=L(\theta_t+\alpha d_t)\) and slope \(\phi'(0)=\nabla L^\top d_t<0\), the condition is \(\phi(\alpha)\le\phi(0)+c_1\alpha\,\phi'(0)\).

Worked example (backtracking). Minimize \(L=\theta^2\) at \(\theta_t=2\), so \(L=4\) and the descent direction is \(d_t=-L'=-4\); take \(c_1=0.1\). The predicted-decrease slope is \(\phi'(0)=L'(\theta_t)\,d_t = 4\cdot(-4) = -16\). Try \(\alpha=1\): the new point is \(2 + 1\cdot(-4) = -2\) with \(L=4\) — no decrease at all, so reject (the Armijo target was \(4 + 0.1\cdot1\cdot(-16)=2.4\), and \(4 > 2.4\)). Halve to \(\alpha=0.5\): new point \(2 - 2 = 0\), \(L=0\), target \(4 + 0.1\cdot0.5\cdot(-16)=3.2\), and \(0 \le 3.2\) — accept. Backtracking found the good step in two tries, no derivative of the step size required.

Conjugate gradient (CG) answers “which direction?”. The problem with steepest descent on an elongated quadratic is that consecutive gradient directions undo each other’s progress. CG instead picks directions that are conjugate with respect to the Hessian \(A\) — meaning \(d_i^\top A\, d_j = 0\) for \(i\neq j\) — so that optimizing along a new direction never spoils the gains from previous ones. The remarkable result: CG minimizes an \(n\)-dimensional quadratic in at most \(n\) steps, using only gradients and vector operations (never forming \(A^{-1}\)). Each new direction blends the current gradient with the previous direction:

\[d_{t+1} = -g_{t+1} + \beta_t d_t, \qquad \beta_t = \frac{g_{t+1}^\top g_{t+1}}{g_t^\top g_t}\ \text{(Fletcher–Reeves)}\]

In words: the next search direction is “go downhill from here, but bend it a little toward the way you just came,” with \(\beta_t\) measuring how much the gradient shrank since the last step.

Also written: the Polak–Ribière variant uses \(\beta_t = \frac{g_{t+1}^\top (g_{t+1}-g_t)}{g_t^\top g_t}\), often more robust on non-quadratic losses.

steepest descent: many zig-zags CG: 2 conjugate steps

CG is the standard solver for large sparse linear systems \(Ax=b\) (equivalently, minimizing a quadratic) and powers many scientific computing workloads. Nonlinear CG extends the idea to general losses, where it pairs naturally with line search.

# scipy: conjugate gradient for a linear system, and nonlinear CG for general losses
from scipy.sparse.linalg import cg
x, info = cg(A, b)                                   # solve Ax = b
from scipy.optimize import minimize
res = minimize(loss_fn, x0, jac=grad_fn, method="CG")  # nonlinear CG

3.9 — Convex optimization & duality

A problem is convex if its objective is a convex function over a convex feasible set. A convex function is one that lies below any chord connecting two of its points — formally \(f(\lambda x + (1-\lambda)y) \le \lambda f(x) + (1-\lambda)f(y)\) — a bowl with no bumps. Convexity is the dividing line between “easy” and “hard” optimization for one reason: every local minimum of a convex function is a global minimum. There are no traps; any downhill method that stops has found the answer.

In words: pick any two points on the curve and draw the straight line between them; for a convex function the curve never pokes above that line. That “no bumps” shape is exactly what guarantees a single bottom.

Also written: twice-differentiable functions are convex iff their Hessian is positive semidefinite everywhere, \(\nabla^2 f(x)\succeq 0\) for all \(x\).

convex: chord above curve non-convex: many minima

Many cornerstone ML models are convex in their parameters — linear and ridge regression, logistic regression, SVMs (in the dual), LASSO — which is exactly why they are reliable and well-understood: solvers find the unique optimum every time.

Duality is the idea that every minimization problem (the primal) comes with a paired maximization problem (the dual). Think of it as attacking the same target from two sides: the primal pushes a ceiling down, the dual pushes a floor up. The dual’s answer can never exceed the primal’s — it is always a valid lower bound — so the two close in on the true optimum like a vise. That guarantee is weak duality, and it always holds. For convex problems (with a mild technical caveat called Slater’s condition) the floor and ceiling actually meet: the two answers are equal and the gap shrinks to zero. This is strong duality. Why care? The lower bound certifies how good a solution is, the dual is sometimes far easier to solve than the primal, and this exact machinery is what makes the “kernel trick” in SVMs work.

Tip

Why ML practitioners care: if you can cast your problem as convex, you get guarantees — a unique global optimum, reproducible results, and a duality gap that certifies optimality. Deep learning gives those up in exchange for expressive power.

# cvxpy: state a convex problem declaratively and a solver finds the global optimum
import cvxpy as cp
x = cp.Variable(2)
prob = cp.Problem(cp.Minimize(cp.sum_squares(A @ x - b)),  # convex least squares
                  [x >= 0, cp.sum(x) == 1])                # convex constraints
prob.solve()
print(x.value, prob.value)

🎮 Try it — Convex Optimization

🎮 Try it — Non-Convex Optimization

3.10 — Lagrange multipliers & the KKT conditions

How do you minimize a function when the answer must satisfy constraints — stay on a surface, or inside a region? Lagrange multipliers are the classic tool for equality constraints. The geometric insight: at a constrained optimum, you cannot improve the objective without violating the constraint, which happens exactly when the objective’s gradient is parallel to the constraint’s gradient.

To minimize \(f(x)\) subject to \(g(x)=0\), form the Lagrangian by attaching the constraint with a multiplier \(\lambda\):

\[\mathcal{L}(x,\lambda) = f(x) + \lambda\, g(x)\]

In words: glue the constraint onto the objective with a “price” \(\lambda\) for breaking it, turning a constrained problem into an ordinary unconstrained one whose stationary points obey both the objective and the rule.

Also written: for several constraints, \(\mathcal{L}(x,\boldsymbol\lambda) = f(x) + \sum_k \lambda_k\, g_k(x)\).

Setting \(\nabla_x \mathcal{L}=0\) gives \(\nabla f = -\lambda \nabla g\) (gradients parallel), and \(\nabla_\lambda \mathcal{L}=0\) recovers the constraint \(g(x)=0\). Solve the system for the constrained optimum.

Worked example. Maximize \(f=xy\) subject to \(x+y=10\). The Lagrangian is \(xy + \lambda(x+y-10)\). Partials: \(y+\lambda=0\), \(x+\lambda=0\), so \(x=y\); the constraint gives \(x=y=5\), with maximum \(xy=25\). (The familiar result: a fixed-perimeter rectangle has maximum area when it is a square.)

For inequality constraints \(g(x)\le 0\), the generalization is the Karush–Kuhn–Tucker (KKT) conditions — the cornerstone of constrained optimization. A point is optimal (for a convex problem) when all four hold:

KKT condition Meaning
Stationarity: \(\nabla f + \sum_i \mu_i \nabla g_i = 0\) gradients balance
Primal feasibility: \(g_i(x)\le 0\) constraints satisfied
Dual feasibility: \(\mu_i \ge 0\) multipliers nonnegative
Complementary slackness: \(\mu_i\, g_i(x)=0\) a constraint is either active (\(g_i=0\)) or its multiplier is zero

Complementary slackness is the subtle, powerful one: an inactive constraint exerts no force (\(\mu_i=0\)), while an active one “pushes back” with \(\mu_i>0\). This is precisely what defines support vectors in an SVM — the few training points with nonzero multipliers, sitting exactly on the margin, that alone determine the boundary.

3.11 — Coordinate descent

Most methods move all parameters at once along the gradient. Coordinate descent takes the opposite, almost stubbornly simple approach: fix every coordinate but one, minimize the loss along that single direction (often in closed form), then cycle to the next coordinate and repeat.

The appeal is that a one-dimensional minimization is frequently trivial even when the full problem is messy. The classic example is LASSO (L1-regularized regression). The L1 penalty \(\lambda\sum|\theta_j|\) is non-differentiable at zero, which complicates gradient methods, but along a single coordinate the optimum has a clean closed form via the soft-thresholding operator:

\[\theta_j \leftarrow \operatorname{sign}(z_j)\,\max(|z_j| - \lambda,\ 0)\]

In words: take the unconstrained best value for one coordinate, then pull it toward zero by \(\lambda\) — and if that pull would cross zero, snap it exactly to zero. That snapping is what makes LASSO drop features.

Also written: the same operator is \(S_\lambda(z) = \operatorname{sign}(z)\,(|z|-\lambda)_+\), where \((\cdot)_+ = \max(\cdot,0)\) is the positive part.

where \(z_j\) is the least-squares solution for coordinate \(j\) holding the rest fixed. This shrinks small coefficients exactly to zero — automatically producing the sparse models LASSO is famous for. Sweeping over coordinates until convergence is the basis of glmnet, one of the fastest LASSO solvers in existence.

def coord_descent_lasso(X, y, lam, iters=100):
    n, d = X.shape
    theta = np.zeros(d)
    for _ in range(iters):
        for j in range(d):
            r = y - X @ theta + X[:, j]*theta[j]   # residual w/o feature j
            z = X[:, j] @ r / (X[:, j] @ X[:, j])  # 1-D least squares
            theta[j] = np.sign(z)*max(abs(z) - lam, 0)  # soft-threshold
    return theta
axis-aligned steps

Coordinate descent shines when per-coordinate updates are cheap and closed-form (LASSO, some SVMs). Its weakness: on problems where coordinates are strongly coupled, axis-aligned moves stall, and a gradient method that moves diagonally wins.

# scikit-learn: Lasso is solved by coordinate descent internally
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)   # alpha plays the role of lambda
print((model.coef_ != 0).sum(), "nonzero features")   # sparsity

3.12 — Constrained optimization & linear programming

When both the objective and all constraints are linear, the problem is a linear program (LP) — one of the most studied and widely deployed optimization models, running quietly behind logistics, scheduling, portfolio allocation, and network flow. The standard form is

\[\min_x\ c^\top x \quad \text{subject to}\quad Ax \le b,\ \ x \ge 0\]

In words: find the cheapest mix \(x\) (cost-per-unit \(c\)) that respects a set of “no more than” limits and uses no negative quantities.

Also written: maximization is the same problem with \(-c\); the constraints \(Ax\le b\) stack one inequality \(a_i^\top x\le b_i\) per row \(i\).

The geometry is elegant. Each linear inequality carves out a half-space; their intersection is a convex polytope — a many-faceted gem — called the feasible region. The objective \(c^\top x\) is a flat plane being pushed in the direction \(-c\), and the key theorem is that an optimum always sits at a vertex (corner) of the polytope. So the search reduces to examining corners.

Worked example. Maximize \(3x+2y\) subject to \(x+y\le 4\), \(x\le 3\), \(x,y\ge 0\). The corners of the feasible region are \((0,0)\), \((3,0)\), \((3,1)\), \((0,4)\). Evaluating the objective: \(0\), \(9\), \(11\), \(8\) — the optimum is \(11\) at the vertex \((3,1)\), where two constraints meet. No interior point can beat a corner, which is exactly why corner-walking works.

optimum at a vertex feasible polytope

The simplex method (Dantzig, 1947) does exactly this: it walks from vertex to adjacent vertex, always improving the objective, until no neighbor is better. It is exponential in the worst case but blazingly fast in practice. Interior-point methods take a different route, cutting through the polytope’s interior toward the optimum, with provable polynomial-time guarantees. When variables must be whole numbers (you cannot ship half a truck), it becomes an integer program — far harder (NP-hard), solved by branch-and-bound. LPs connect directly to the duality and KKT theory of the previous sections: every LP has a dual LP, and their optima coincide.

# scipy: linprog solves LPs in standard min form (it minimizes c^T x)
from scipy.optimize import linprog
# maximize 3x+2y  ->  minimize -3x-2y, with x+y<=4 and x<=3
res = linprog(c=[-3, -2], A_ub=[[1, 1], [1, 0]], b_ub=[4, 3], bounds=[(0, None)]*2)
print(res.x, -res.fun)   # -> [3, 1]  optimum 11

3.13 — Evolutionary / genetic optimization

Every method so far needs a gradient or a tidy structure. But what if the objective is a black box — non-differentiable, noisy, riddled with discrete choices, or simply too gnarly to differentiate? Evolutionary algorithms optimize by mimicking natural selection, needing nothing but the ability to evaluate a candidate’s quality.

A genetic algorithm (GA) maintains a population of candidate solutions, each encoded as a “chromosome.” Each generation runs a loop borrowed from biology: score every candidate with a fitness function, select the fitter ones to reproduce, crossover pairs of parents to mix their traits into children, and mutate children with small random changes to inject novelty. Over many generations the population drifts toward high-fitness regions.

flowchart LR
    A[Initialize random population] --> B[Evaluate fitness]
    B --> C{Stop?}
    C -- no --> D[Select parents]
    D --> E[Crossover]
    E --> F[Mutate]
    F --> B
    C -- yes --> G[Return best individual]

Worked example (one-max). Suppose a chromosome is 8 bits and fitness is the number of 1s (optimum = 11111111). Take parents 11010010 and 00110111. A single-point crossover after bit 4 yields children 1101|0111 and 0011|0010. The first child, 11010111, has fitness 6 — already fitter than either parent (4 and 5). A mutation might flip one of its 0s to a 1, nudging it toward the optimum. Repeated selection of such lucky recombinations drives the population to all-ones.

def mutate(bits, rate=0.05):
    return [1-b if np.random.rand() < rate else b for b in bits]  # flip w/ prob
def crossover(a, b):
    k = np.random.randint(1, len(a))      # cut point
    return a[:k]+b[k:], b[:k]+a[k:]        # swap tails

Relatives include evolution strategies (mutating real-valued vectors, used for some RL policies), genetic programming (evolving programs or expression trees), and differential evolution. Their strength is generality — no gradient, robust to rugged landscapes, easily parallelized. Their weakness is sample-hungriness: many thousands of evaluations, with weak convergence guarantees. They earn their keep on hyperparameter search, neural architecture search, and black-box engineering design. (Chapter 35 covers the broader family of metaheuristics.)

Warning

Common mistake: reaching for a genetic algorithm when a gradient is available. If you can differentiate the objective, gradient methods are orders of magnitude more sample-efficient — save evolution for the truly black-box cases.

# scipy: differential evolution optimizes a black-box objective with no gradient
from scipy.optimize import differential_evolution
res = differential_evolution(lambda v: (v[0]-2)**2 + (v[1]+1)**2, bounds=[(-5, 5)]*2)
print(res.x, res.fun)   # -> ~[2, -1], near 0

3.14 — Subgradient & proximal methods

Several of the most important ML losses are not differentiable everywhere: the L1 penalty \(|\theta|\) has a corner at zero, the hinge loss of an SVM has a kink, ReLU networks have non-smooth points. Plain gradient descent stalls or oscillates at these corners because the gradient is undefined. Two ideas rescue us.

Subgradient descent generalizes the gradient to a set of valid slopes at a kink. The intuition: at a corner of \(|\theta|\) you can lay any line that stays below the function — every such slope is a legal “subgradient,” and stepping against any of them still goes downhill. Formally a vector \(g\) is a subgradient of convex \(f\) at \(x\) if

\[f(y)\ \ge\ f(x) + g^\top(y-x)\quad\text{for all } y\]

In words: \(g\) is a subgradient if the straight line through \(x\) with slope \(g\) never rises above the function — a supporting line from below. At a smooth point there is exactly one such line (the gradient); at a corner there are many.

Also written: the set of all such \(g\) is the subdifferential \(\partial f(x)\); for \(f(\theta)=|\theta|\), \(\partial f(0) = [-1, 1]\). The update is just \(\theta_{t+1} = \theta_t - \eta_t\, g_t\) with \(g_t \in \partial f(\theta_t)\).

Subgradient descent works but is slow — it needs a decaying step size and has no real notion of “stop when the gradient is zero.” Proximal methods are sharper: they split the loss into a smooth part \(f\) and a simple non-smooth part \(h\) (e.g. the L1 penalty), take an ordinary gradient step on \(f\), then apply \(h\) exactly through its proximal operator:

\[\theta_{t+1} = \operatorname{prox}_{\eta h}\!\big(\theta_t - \eta\,\nabla f(\theta_t)\big),\qquad \operatorname{prox}_{\eta h}(z) = \arg\min_\theta\Big(h(\theta) + \tfrac{1}{2\eta}\|\theta - z\|^2\Big)\]

In words: take a normal downhill step on the smooth part, then “clean up” with the messy part by finding the point that balances obeying \(h\) against not moving too far from where you landed.

Also written: this is proximal gradient descent (a.k.a. ISTA for LASSO); when \(h=\lambda\|\theta\|_1\) the prox operator is the soft-thresholding \(S_{\eta\lambda}\) of Section 3.11, which is why coordinate descent and proximal methods agree on LASSO.

kink: many supporting lines f(θ) = |θ|
# Proximal gradient (ISTA) for LASSO: smooth step on the fit, soft-threshold the L1
def ista_lasso(X, y, lam, lr=0.01, steps=500):
    theta = np.zeros(X.shape[1])
    for _ in range(steps):
        grad = X.T @ (X @ theta - y)               # gradient of smooth part
        z = theta - lr*grad                        # plain gradient step
        theta = np.sign(z)*np.maximum(np.abs(z) - lr*lam, 0)  # prox = soft-threshold
    return theta

Proximal methods are the backbone of modern sparse-learning and signal-processing solvers (LASSO, group LASSO, total-variation denoising); accelerated versions (FISTA) add Nesterov momentum on top for a further speedup.

3.15 — Gradient checking & numerical gradients

Every optimizer in this chapter is only as good as the gradient it is fed. A subtle bug in a hand-derived or hand-coded gradient produces an optimizer that runs and reduces some loss but converges to the wrong place — one of the most maddening failures in ML, because nothing crashes. Gradient checking is the cheap, decisive sanity test.

The intuition: the derivative is the limit of “rise over run,” so we can estimate it by literally nudging a parameter a tiny bit and measuring the change in loss. The centered finite-difference estimate is far more accurate than a one-sided one:

\[\frac{\partial L}{\partial \theta_j}\ \approx\ \frac{L(\theta + \varepsilon e_j) - L(\theta - \varepsilon e_j)}{2\varepsilon}\]

In words: to estimate the slope along one parameter, bump it up by a hair and down by a hair, see how much the loss moved, and divide by the total nudge — symmetric bumping cancels the leading error.

Also written: with \(e_j\) the unit vector for coordinate \(j\), this is the second-order-accurate stencil; the one-sided version \(\big(L(\theta+\varepsilon e_j)-L(\theta)\big)/\varepsilon\) is simpler but less accurate (\(O(\varepsilon)\) vs \(O(\varepsilon^2)\) error).

Compute the analytic gradient \(g_{\text{analytic}}\) and the numerical estimate \(g_{\text{num}}\), then compare with a relative error (absolute error is meaningless across scales):

\[\text{rel.err} = \frac{\|g_{\text{analytic}} - g_{\text{num}}\|}{\|g_{\text{analytic}}\| + \|g_{\text{num}}\| + \delta}\]

A relative error below \(\sim 10^{-7}\) means the gradient is almost certainly correct; above \(\sim 10^{-4}\) signals a bug. Use \(\varepsilon\approx 10^{-5}\): too large and the approximation is crude, too small and floating-point round-off dominates.

def grad_check(loss, grad, theta, eps=1e-5):
    g_analytic = grad(theta)
    g_num = np.zeros_like(theta)
    for j in range(len(theta)):
        e = np.zeros_like(theta); e[j] = eps
        g_num[j] = (loss(theta + e) - loss(theta - e)) / (2*eps)  # centered diff
    rel = np.linalg.norm(g_analytic - g_num) / (
          np.linalg.norm(g_analytic) + np.linalg.norm(g_num) + 1e-12)
    return rel    # expect << 1e-5 if the analytic gradient is correct

# PyTorch ships a rigorous checker for autograd-defined functions:
# torch.autograd.gradcheck(fn, inputs, eps=1e-6, atol=1e-4)
Tip

Practical workflow: gradient-check on a tiny random input with double precision before any real training run. It is slow (\(O(n)\) extra loss evaluations) so you do it once, not every step — but it catches the bug that would otherwise cost you a week of confusing experiments.

Warning

Common mistake: gradient-checking through non-smooth points (ReLU at 0, L1 at 0) and panicking at the mismatch. Finite differences straddle the kink and disagree with the subgradient there — test at smooth points, or nudge inputs away from exact zeros.

3.16 — Quick reference

Method / term One-line meaning When / why to use it
Gradient descent (GD) \(\theta \leftarrow \theta - \eta\nabla L\), step against the slope the base case; small, smooth, full-batch problems
Learning rate \(\eta\) step size; safe if \(\eta\le 1/\beta\) (smoothness) the make-or-break knob — cut 3–10× if loss diverges
SGD / mini-batch step on a noisy gradient from \(B\) examples the deep-learning default; cheap steps + regularizing noise
Momentum / NAG accumulate velocity \(v=\beta v+g\), glide through ravines ill-conditioned valleys; \(\beta=0.9\), NAG brakes earlier
AdaGrad divide step by \(\sqrt{\sum g^2}\) (sum grows forever) sparse features; warning — learning rate decays to zero
RMSProp AdaGrad with an EMA of \(g^2\) (forgets old gradients) non-stationary / online; step size adapts without dying
Adam / AdamW momentum + RMSProp + bias correction safe first choice; AdamW (decoupled decay) for transformers
LR schedule vary \(\eta_t\): warmup, cosine decay, cyclical warmup-then-decay is standard for large models
Newton \(\theta \leftarrow \theta - H^{-1}g\), uses curvature tiny smooth problems; exact in one step on a quadratic
L-BFGS quasi-Newton from last \(m\) gradients, \(O(mn)\) memory large, smooth, full-batch (logistic regression, CRFs)
Conjugate gradient \(A\)-conjugate directions, no zig-zag large sparse linear systems \(Ax=b\); quadratics in \(\le n\) steps
Line search (Armijo/Wolfe) pick step length \(\alpha\) that decreases loss enough when a good fixed \(\eta\) is unknown; pairs with CG / L-BFGS
Convexity / duality every local min is global; dual bounds the primal guarantees + optimality certificates (SVMs, LP)
KKT conditions stationarity + feasibility + complementary slackness tests for constrained optima; defines SVM support vectors
Coordinate descent optimize one coordinate at a time (soft-threshold) LASSO and separable problems; closed-form per-coordinate
Linear programming linear objective + constraints; optimum at a vertex scheduling, routing, allocation; simplex / interior-point
Subgradient / proximal handle kinks via subgradients or a prox operator non-smooth losses (L1, hinge); prox of L1 = soft-threshold
Evolutionary / GA population + select/crossover/mutate, gradient-free black-box, non-differentiable objectives; sample-hungry
Gradient check centered finite difference vs analytic gradient one-time sanity test; rel.err \(<10^{-7}\) means correct

3.17 — Key takeaways

  • Gradient descent is the master algorithm: step in the negative-gradient direction; the learning rate is the make-or-break knob.
  • SGD with mini-batches trades exact gradients for cheap, noisy ones — the noise is both a speedup and a regularizer, and it is the default for deep learning.
  • Momentum/NAG add inertia to glide through ravines; AdaGrad/RMSProp adapt a per-parameter learning rate; Adam fuses both and is the standard first choice.
  • Learning-rate schedules (warmup, cosine decay, cyclical) matter as much as the optimizer; large models almost always use warmup-then-decay.
  • Newton/quasi-Newton (L-BFGS) exploit curvature for fast convergence on smooth, full-batch problems but falter on noisy mini-batch training.
  • Convexity guarantees a global optimum; duality and the KKT conditions certify optimality and underpin SVMs and linear programming.
  • Coordinate descent wins when per-coordinate updates are closed-form (LASSO); linear programming solves linear-objective, linear-constraint problems at polytope vertices.
  • Subgradient and proximal methods handle non-smooth losses (L1, hinge); the prox operator for L1 is exactly soft-thresholding, tying together LASSO solvers.
  • Gradient checking with centered finite differences is the cheap insurance against a silently wrong gradient — verify once before trusting any optimizer.
  • Evolutionary algorithms are the fallback for black-box, non-differentiable objectives — general but sample-hungry.

3.18 — See also

  • Calculus & Differentiation — gradients, Hessians, and the chain rule that make these methods computable.
  • Linear Algebra — vectors, matrices, and the conditioning that determines how hard a loss surface is to optimize.
  • Probability & Statistics — the expectation view of SGD and the noise that drives stochastic methods.
  • Neural Networks (Core) — backpropagation, the gradient source these optimizers consume.
  • Regression — convex least-squares and LASSO, direct applications of coordinate descent and convex optimization.
  • Model Evaluation & Tuning — learning-rate search and hyperparameter optimization, where evolutionary methods reappear.
  • Evolutionary Computation & Metaheuristics — the full treatment of genetic algorithms and their relatives.
  • Reinforcement Learning — policy-gradient and evolution-strategy optimization in sequential decision problems.

↪ The thread continues → Chapter 04 · 🎲 Probability & Statistics

Optimization finds the best parameters for a fixed objective — but where does that objective come from, and how do we reason when data is noisy? For that we need the language of uncertainty.


📖 All chapters  |  ← 02 · ∂ Calculus & Differentiation  |  04 · 🎲 Probability & Statistics →

 

© Kader Mohideen