Chapter 42 — 📐 Learning Theory

📖 All chapters | ← 41 · 🤖 Robotics & Autonomy | 43 · 🔎 Information Retrieval & Data Mining →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Empirical curves tell you a model did well on the data you happened to test. Learning theory asks a harder question: will it do well on data it has never seen, and can we prove it before deployment? This chapter builds the mathematical machinery — concentration inequalities, complexity measures, and convergence guarantees — that turns “it worked on the validation set” into “it will generalize, with high probability, by this much.” We end at the frontier, where classical theory strains to explain why enormous neural networks generalize at all.

🧭 In context: statistical learning theory · use it to bound generalization error and reason about why learning is possible · key idea: with enough data, empirical performance provably tracks true performance across an entire hypothesis class at once.

💡 Remember this: generalization is provable — with enough data, the gap between training error and true error shrinks at a rate set by the effective complexity of your model class, not its raw parameter count.

42.1 Why theory: from empirical curves to provable guarantees

Suppose you train a classifier, measure 3% error on a held-out test set, and ship it. Two questions should nag you. First, how close is that 3% to the true error on the full distribution of inputs the model will face? Second, you tried fifty model configurations and picked the best — did you find a genuinely good model, or did you just get lucky on this particular test set? Learning theory answers both, and the answers are quantitative.

The intuition first. Think of a chef tasting one spoonful from a giant pot of soup. The spoonful (your test set) tells you something about the whole pot (the true distribution) — but only if the soup is stirred (i.i.d. sampling) and the spoon is big enough (large \(n\)). Learning theory is the recipe for how big the spoon must be, and how confident you can be that the spoonful matches the pot, especially when you taste many pots and report the best one you found.

The setup throughout this chapter: data points \((x, y)\) are drawn independently from some fixed but unknown distribution \(\mathcal{D}\). A hypothesis \(h\) maps inputs to predictions, and a loss \(\ell(h(x), y)\) scores how wrong it is. We care about two quantities:

\[ \underbrace{R(h) = \mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(h(x), y)]}_{\text{true / population risk}} \qquad \underbrace{\hat R(h) = \frac{1}{n}\sum_{i=1}^{n} \ell(h(x_i), y_i)}_{\text{empirical risk}} \]

In words: the true risk is the average mistake the model would make over all possible data; the empirical risk is the average mistake on the finite sample you actually have. Also written: \(R(h) = \int \ell(h(x),y)\,d\mathcal{D}(x,y)\) and \(\hat R(h) = \frac{1}{n}\langle \mathbf{1}, \boldsymbol{\ell}\rangle\) where \(\boldsymbol{\ell}=(\ell(h(x_1),y_1),\dots,\ell(h(x_n),y_n))\) is the vector of per-example losses.

We can only measure \(\hat R\); we care about \(R\). The entire field is the study of the gap \(R(h) - \hat R(h)\), the generalization gap. Theory gives us statements of the form: with probability at least \(1-\delta\) over the random training sample, this gap is at most some bound that shrinks with \(n\) and grows with the “richness” of the models we allowed ourselves to consider.

Tip

Intuition. \(\hat R\) is a coin-flip estimate of \(R\) from \(n\) flips. For one fixed hypothesis, the law of large numbers already says \(\hat R \to R\). The subtlety — and the reason this chapter exists — is that we choose \(h\) after seeing the data, so \(h\) depends on the sample. We need the estimate to be good simultaneously for every \(h\) we might pick. That is uniform convergence, and it costs extra.

42.2 The PAC framework

Probably Approximately Correct (PAC) learning, introduced by Valiant, formalizes “learnable.” We never demand a perfect classifier (impossible from finite noisy data) nor certainty (a freak unrepresentative sample is always possible). We settle for approximately correct (error \(\le \varepsilon\)) and probably so (with confidence \(\ge 1-\delta\)).

Plain-language version. PAC says: “Give me enough examples and I promise to hand you a model that is usually pretty good.” “Usually” is the probably (the \(1-\delta\) confidence — you might draw a freak sample). “Pretty good” is the approximately (the \(\varepsilon\) error — you can’t expect perfection from finite data). Both knobs are tunable; you just pay in data.

The ingredients:

A hypothesis class \(\mathcal{H}\) — the set of functions the learner is allowed to output (all linear classifiers, all depth-\(d\) decision trees, a fixed neural architecture’s weight space).
A sample complexity \(m(\varepsilon, \delta)\) — how many examples suffice to guarantee error \(\le\varepsilon\) with probability \(\ge 1-\delta\).

A class is PAC-learnable if there is an algorithm that, for any \(\varepsilon, \delta > 0\), returns such an \(h\) using only \(m(\varepsilon,\delta)\) samples (polynomial in \(1/\varepsilon\) and \(1/\delta\)).

Two regimes matter:

	Realizable	Agnostic
Assumption	some \(h^{\star}\in\mathcal{H}\) has zero true error	no perfect \(h\) exists; we compete with the best in \(\mathcal{H}\)
Goal	\(R(\hat h) \le \varepsilon\)	\(R(\hat h) \le \min_{h\in\mathcal{H}} R(h) + \varepsilon\)
Sample complexity (finite \(\mathcal{H}\))	\(\dfrac{1}{\varepsilon}\left(\ln\lvert\mathcal{H}\rvert + \ln\frac{1}{\delta}\right)\)	\(\dfrac{1}{2\varepsilon^2}\left(\ln\lvert\mathcal{H}\rvert + \ln\frac{1}{\delta}\right)\)

Notice the agnostic case scales as \(1/\varepsilon^2\) rather than \(1/\varepsilon\) — handling noise is fundamentally more expensive. We will derive both. The key structural insight: sample complexity grows only logarithmically in the number of hypotheses, which is what makes learning over huge classes feasible.

For the realizable sample complexity formula:

In words: to halve the error you tolerate, you need twice the data; to add another order of magnitude of confidence (\(\delta \to \delta/10\)) you pay only an additive \(\ln 10\); and a thousand-fold bigger hypothesis class costs only \(\ln 1000 \approx 6.9\) more examples per unit of \(1/\varepsilon\). Also written: \(m \ge \varepsilon^{-1}\big(\log_e\lvert\mathcal H\rvert + \log_e \delta^{-1}\big)\) — equivalently the achievable error is \(\varepsilon \le \tfrac{1}{m}(\ln\lvert\mathcal H\rvert + \ln\tfrac1\delta)\) after collecting \(m\) samples.

Worked example (realizable, finite class). Say \(\lvert\mathcal{H}\rvert = 10^6\) candidate rules, and we want error \(\le 1\%\) with 99% confidence. Then \[ m \ge \frac{1}{0.01}\left(\ln 10^6 + \ln 100\right) = 100\,(13.8 + 4.6) \approx 1{,}840 \text{ examples.} \] A million hypotheses, fewer than two thousand examples. The logarithm is doing heroic work.

# scikit-learn: empirically watch the generalization gap shrink as n grows,
# the practical face of a PAC sample-complexity curve.
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import learning_curve
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=4000, n_features=20, n_informative=8, random_state=0)
sizes, train_sc, val_sc = learning_curve(
    DecisionTreeClassifier(max_depth=5), X, y,
    train_sizes=np.linspace(0.05, 1.0, 8), cv=5)
gap = train_sc.mean(1) - val_sc.mean(1)        # train acc - val acc ≈ generalization gap
for n, g in zip(sizes, gap):
    print(f"n={n:5.0f}  gap={g:.3f}")          # gap shrinks roughly like 1/sqrt(n)

42.3 Concentration inequalities: the toolkit

Every generalization bound rests on a concentration inequality — a statement that a random average stays close to its mean. They form a ladder: each rung assumes more and pays off with a tighter bound.

Everyday picture. These inequalities all answer one question: “How surprised should I be if an average wanders far from where it should be?” The more you know about the random thing (just its mean? its variance? that it’s bounded?), the more confidently you can say “that far is essentially impossible.” Knowing more buys you tighter guarantees — that is the whole ladder.

graph LR
  A["Markov<br/>needs: X ≥ 0, mean"] --> B["Chebyshev<br/>needs: variance"]
  B --> C["Hoeffding<br/>needs: bounded range"]
  C --> D["Bernstein<br/>needs: variance + bound"]
  C --> E["Sub-Gaussian<br/>needs: light tails"]
  style A fill:#ec4899,stroke:#a37
  style C fill:#22c55e,stroke:#3a7

Markov. For a non-negative random variable, \(\Pr[X \ge a] \le \mathbb{E}[X]/a\). Crude — only uses the mean — but it is the seed from which the others grow.

In words: a positive quantity can exceed several times its average only a small fraction of the time (it can’t be twice its mean more than half the time, ten times more than a tenth of the time, etc.). Also written: equivalently \(\Pr[X \ge k\,\mathbb{E}[X]] \le 1/k\) for \(k>0\).

Chebyshev. Apply Markov to \((X-\mu)^2\): \(\Pr[|X-\mu|\ge a] \le \sigma^2/a^2\). Now tail probability falls as \(1/a^2\).

In words: the chance of landing more than \(a\) away from the mean is capped by the variance divided by \(a^2\) — spread out by \(k\) standard deviations and the odds are at most \(1/k^2\). Also written: \(\Pr[|X-\mu| \ge k\sigma] \le 1/k^2\) (substitute \(a = k\sigma\)).

Hoeffding. The workhorse. For independent variables \(X_i \in [a_i, b_i]\) with mean \(\bar X = \frac1n\sum X_i\), \[ \Pr\!\left[\,|\bar X - \mathbb{E}\bar X| \ge t\,\right] \le 2\exp\!\left(\frac{-2n^2 t^2}{\sum_i (b_i-a_i)^2}\right). \]

In words: the probability that an average of bounded independent numbers drifts by more than \(t\) from its expected value dies off exponentially fast as you collect more samples. Also written: for the common case \(X_i\in[0,1]\) this is \(\Pr[\,|\bar X - \mathbb{E}\bar X|\ge t\,] \le 2e^{-2nt^2}\); rearranged into a confidence statement, w.p. \(\ge 1-\delta\), \(|\bar X - \mathbb{E}\bar X| \le \sqrt{\ln(2/\delta)/(2n)}\).

The tail is exponential in \(n\) — vastly stronger than Chebyshev’s polynomial decay. This exponential is exactly what lets a union bound over many hypotheses survive (Section 42.4).

Bernstein / sub-Gaussian. When the variance \(\sigma^2\) is much smaller than the range, Bernstein’s inequality replaces the range with the variance in the dominant term, giving the fast rate \(\sim \sigma^2/n\) instead of \(1/\sqrt{n}\). This is the formal reason low-noise problems learn faster (the realizable \(1/\varepsilon\) vs. agnostic \(1/\varepsilon^2\) split above).

Tip

Worked Hoeffding bound. A coin has unknown bias \(p\). Flip it \(n=1000\) times, observe frequency \(\hat p\). Each \(X_i\in[0,1]\) so \(b_i-a_i=1\). How far can \(\hat p\) stray from \(p\)? \[ \Pr[\,|\hat p - p| \ge t\,] \le 2e^{-2nt^2}. \] Set the right side to \(\delta=0.05\): \(2e^{-2000t^2}=0.05 \Rightarrow t = \sqrt{\frac{\ln(2/0.05)}{2000}} = \sqrt{\frac{3.69}{2000}} \approx 0.043\). So with 95% confidence, \(|\hat p - p| \le 4.3\%\) — and the bound holds for any \(p\), no distributional assumption beyond independence.

import numpy as np
# empirical check of the Hoeffding bound: how often does |p_hat - p| exceed t?
p, n, t, trials = 0.3, 1000, 0.043, 200_000
flips = (np.random.rand(trials, n) < p)
phat  = flips.mean(axis=1)
viol  = np.mean(np.abs(phat - p) >= t)   # observed violation rate
bound = 2*np.exp(-2*n*t**2)              # Hoeffding's promise
print(round(viol,4), round(bound,4))     # ~0.04  0.05  -> violations stay under the bound

42.4 Uniform convergence and the union bound

Hoeffding bounds one fixed hypothesis. But the learner inspects the data and picks \(\hat h = \arg\min_{h} \hat R(h)\), so \(\hat h\) is correlated with the sample — a single-hypothesis bound does not apply to it. The fix is to bound all hypotheses simultaneously, then the chosen one is automatically covered.

Why one bound isn’t enough — a lottery analogy. Buy one lottery ticket and your odds of winning are tiny. Buy a million tickets and someone’s ticket probably wins. Same here: any single hypothesis is unlikely to have a misleadingly low empirical error, but if you search over a million of them and pick the luckiest-looking one, you’ve effectively bought a million tickets. The union bound charges you for all those tickets up front so the winner you keep is trustworthy.

For a finite class, the union bound does this. If each hypothesis individually fails its bound with probability \(\le\delta'\), then the probability that any of the \(\lvert\mathcal{H}\rvert\) hypotheses fails is at most \(\lvert\mathcal{H}\rvert\cdot\delta'\). Set \(\delta' = \delta/\lvert\mathcal{H}\rvert\) and invert Hoeffding:

\[ \text{w.p. } \ge 1-\delta,\quad \forall h\in\mathcal{H}:\quad R(h) \le \hat R(h) + \sqrt{\frac{\ln\lvert\mathcal{H}\rvert + \ln(2/\delta)}{2n}}. \]

In words: simultaneously for every hypothesis in the class, the true error is no more than the measured error plus a cushion that grows with the log-size of the class and shrinks like \(1/\sqrt{n}\). Also written: \(R(h) - \hat R(h) \le \sqrt{\tfrac{1}{2n}\big(\ln\lvert\mathcal H\rvert + \ln 2 - \ln\delta\big)}\) for all \(h\in\mathcal H\).

graph TD
  H1["Hoeffding: P(fail) ≤ δ' for one h"] --> U["Union bound: P(any of |H| fail) ≤ |H|·δ'"]
  U --> S["Set δ' = δ/|H| → invert Hoeffding"]
  S --> R["Uniform bound holds for the chosen ĥ for free"]

That \(\sqrt{\ln\lvert\mathcal{H}\rvert/n}\) is the generalization gap. Rearranging to get the gap below \(\varepsilon\) recovers the agnostic sample complexity \(m \sim \frac{1}{\varepsilon^2}(\ln\lvert\mathcal{H}\rvert + \ln\frac1\delta)\) from Section 42.2. The \(\ln\lvert\mathcal{H}\rvert\) is the price of searching the class; the union bound is how we pay it.

The animation below is the lottery analogy in motion: each hypothesis is a “ticket” whose empirical error wobbles around its true error. Search enough of them and one will look great by chance — the union bound widens everyone’s safety cushion so the luckiest-looking ticket you keep is still trustworthy.

The obvious problem: most interesting classes are infinite — a linear classifier has continuous weights, so \(\lvert\mathcal{H}\rvert=\infty\) and the bound is vacuous. We need a complexity measure that captures the effective size of an infinite class. That is VC dimension.

42.5 VC dimension and shattering

Two hypothesis classes can both be infinite yet differ wildly in expressive power. Shattering measures expressiveness by what labelings a class can realize, ignoring the raw count of hypotheses.

Intuition. Forget counting hypotheses — there are infinitely many. Ask instead: how many points can this class label in every conceivable way? If you hand the class a few points and it can produce all \(\pm\) patterns over them, the class is genuinely flexible on that many points. The largest number of points it can fully “color any way you like” is its true expressive capacity. A class that can color any pattern over 1000 points is far more dangerous (likely to overfit) than one that maxes out at 3 points.

A set of \(k\) points is shattered by \(\mathcal{H}\) if, for every one of the \(2^k\) ways to label them \(\pm\), some \(h\in\mathcal{H}\) realizes that labeling. The VC dimension \(d_{VC}(\mathcal{H})\) is the size of the largest set that \(\mathcal{H}\) can shatter.

Worked example: thresholds on a line. \(\mathcal{H} = \{h_\theta(x) = \mathbb{1}[x > \theta]\}\).

One point: both labelings (+ and −) achievable by placing \(\theta\) on either side. ✓ shatters 1 point.
Two points \(x_1 < x_2\): can we get \(x_1{=}{+}, x_2{=}{-}\)? That needs \(x_1 > \theta\) and \(x_2 \le \theta\), i.e. \(\theta < x_1\) and \(\theta \ge x_2\) — impossible since \(x_1 < x_2\). ✗

So \(d_{VC} = 1\) for thresholds. For intervals \(\mathbb{1}[a \le x \le b]\), the answer is 2; for linear classifiers in \(\mathbb{R}^d\), it is \(d+1\).

The payoff is the VC generalization bound (Vapnik–Chervonenkis): with probability \(\ge 1-\delta\), for all \(h\in\mathcal{H}\),

\[ R(h) \le \hat R(h) + O\!\left(\sqrt{\frac{d_{VC}\,\ln(n/d_{VC}) + \ln(1/\delta)}{n}}\right). \]

In words: true error stays within a \(1/\sqrt{n}\)-shrinking cushion of training error, where the cushion is set by the VC dimension (the class’s effective capacity) rather than by counting hypotheses. Also written: the gap obeys \(R(h)-\hat R(h) = O\!\big(\sqrt{d_{VC}/n}\big)\) up to log factors — i.e. it scales as \(\sqrt{\text{capacity} / \text{data}}\).

Here \(d_{VC}\) plays the role \(\ln\lvert\mathcal{H}\rvert\) played for finite classes — it is the effective log-cardinality. The Sauer–Shelah lemma is the engine: a class of VC dimension \(d\) can produce at most \(O(n^d)\) distinct labelings of \(n\) points, polynomially many rather than \(2^n\), so the union bound becomes tractable even over infinitely many hypotheses.

Where this shows up. The “\(d_{VC}=d+1\) for linear classifiers in \(\mathbb{R}^d\)” rule is the back-of-envelope every practitioner uses without naming it: a logistic-regression model with 50 features has \(d_{VC}\approx 51\), so the \(\sqrt{d_{VC}/n}\) heuristic says you want \(n\) comfortably in the thousands before the training accuracy is believable. Add a polynomial feature expansion and \(d_{VC}\) balloons — which is exactly why expanded-feature models overfit on small datasets. VC dimension is the formalization of the folk rule “you need several times more rows than columns.”

Warning

Pitfall. VC bounds are distribution-free and worst-case, so they are notoriously loose — often predicting you need millions of samples where thousands suffice. They are a tool for understanding what makes learning possible and the shape of dependencies (gap \(\sim\sqrt{d_{VC}/n}\)), not for tight numerical predictions. And for modern nets, \(d_{VC}\) can exceed \(n\), making the bound vacuous (Section 42.9).

42.6 Rademacher complexity and margin bounds

VC dimension is combinatorial and distribution-free — it ignores both the data distribution and the loss’s continuous structure. Rademacher complexity is data-dependent and tighter, and it extends naturally to real-valued predictors.

The idea: a rich class can fit random noise. Generate random sign labels \(\sigma_i \in \{\pm1\}\) (fair coin flips) and ask how well \(\mathcal{H}\) can correlate with them on your actual sample \(S=\{x_1,\dots,x_n\}\):

\[ \hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\left[\sup_{h\in\mathcal{H}} \frac{1}{n}\sum_{i=1}^n \sigma_i\, h(x_i)\right]. \]

In words: flip a coin to label every point at random, then measure how well the best hypothesis in the class can chase that meaningless noise — averaged over many coin-flip patterns. A class that can chase noise can also memorize, hence overfit. Also written: \(\hat{\mathfrak{R}}_S(\mathcal{H}) = \mathbb{E}_{\sigma}\big[\sup_{h\in\mathcal H} \tfrac1n\,\boldsymbol{\sigma}^{\top} h(S)\big]\), the expected (normalized) inner product between a random sign vector \(\boldsymbol{\sigma}\) and the class’s prediction vector \(h(S)=(h(x_1),\dots,h(x_n))\).

If \(\mathcal{H}\) can match pure noise (high \(\hat{\mathfrak{R}}\)), it will overfit; if it cannot, it must generalize. This yields the bound: w.p. \(\ge 1-\delta\), for all \(h\), \[ R(h) \le \hat R(h) + 2\,\hat{\mathfrak{R}}_S(\mathcal{H}) + 3\sqrt{\frac{\ln(2/\delta)}{2n}}. \]

In words: true error is at most training error, plus twice how much the class can fit noise, plus a small confidence cushion. Also written: \(R(h) - \hat R(h) \le 2\hat{\mathfrak R}_S(\mathcal H) + 3\sqrt{\ln(2/\delta)/(2n)}\) — the data-dependent analogue of the VC gap, with \(\hat{\mathfrak R}_S\) replacing \(\sqrt{d_{VC}/n}\).

The margin connection (SVMs). For linear classifiers, the relevant quantity is not the dimension but the margin \(\gamma\) — the distance from the decision boundary to the nearest point. The Rademacher complexity of the class of unit-norm linear separators scales like \(\frac{R}{\gamma\sqrt n}\), where \(R\) bounds the data norm. Crucially this is independent of the ambient dimension. A wide-margin classifier generalizes well even in infinite-dimensional feature spaces (kernels!), because margin, not coordinate count, controls effective complexity. This is the theoretical heart of why maximizing the margin (SVMs are covered under Classification) is the right objective.

# Estimate empirical Rademacher complexity of linear models by literally
# fitting random sign labels: a rich class chases noise, a constrained one can't.
import numpy as np
from sklearn.linear_model import LogisticRegression
rng = np.random.default_rng(0)
X = rng.standard_normal((200, 50))
def rademacher(C, reps=20):
    vals = []
    for _ in range(reps):
        sigma = rng.choice([-1, 1], size=len(X))          # random ±1 labels
        clf = LogisticRegression(C=C, max_iter=500).fit(X, sigma)
        vals.append(np.mean(sigma * clf.decision_function(X)))  # correlation with noise
    return np.mean(vals)
print("loose (C=100):", round(rademacher(100), 3))   # higher -> fits noise well
print("tight (C=0.01):", round(rademacher(0.01), 3)) # strong regularization -> can't

42.7 The approximation–estimation decomposition

Where does error come from? Split the excess risk of our learned \(\hat h\) relative to the best possible function (the Bayes-optimal \(h^{\star}\)) into two competing pieces:

\[ \underbrace{R(\hat h) - R(h^{\star})}_{\text{excess risk}} = \underbrace{\Big(R(\hat h) - \inf_{h\in\mathcal{H}} R(h)\Big)}_{\text{estimation error}} + \underbrace{\Big(\inf_{h\in\mathcal{H}} R(h) - R(h^{\star})\Big)}_{\text{approximation error}}. \]

In words: your total shortfall versus the ideal predictor is “how far your fitted model is from the best the class can do” (estimation) plus “how far the best the class can do is from the true ideal” (approximation). Also written: with \(h^{\star}_{\mathcal H}=\arg\min_{h\in\mathcal H}R(h)\), excess risk \(= \big[R(\hat h)-R(h^{\star}_{\mathcal H})\big] + \big[R(h^{\star}_{\mathcal H})-R(h^{\star})\big]\) = variance-like term + bias-like term.

Approximation error — how good is the best hypothesis in \(\mathcal{H}\)? Shrinks as \(\mathcal{H}\) grows richer. This is bias.
Estimation error — how far is our \(\hat h\) from that best one, given only \(n\) samples? Grows as \(\mathcal{H}\) grows richer (more to search, bigger generalization gap \(\sim\sqrt{d_{VC}/n}\)). This is variance.

Everyday analogy. Approximation error is “is the right answer even on the menu I’m choosing from?” — a wider menu (richer class) always contains better dishes. Estimation error is “with only a few tastes, will I actually pick the best dish on the menu?” — a longer menu makes that harder. The best total experience comes from a menu that’s rich enough to contain something great but short enough that you can reliably find it.

This is the bias–complexity tradeoff in its sharpest form: enlarging \(\mathcal{H}\) lowers approximation error but raises estimation error. The classical U-shaped test-error curve is their sum.

42.8 Stability, regularization, and the bias–variance bound

There is a second, complementary lens on generalization that does not look at the hypothesis class at all — it looks at the algorithm (the optimization procedure itself). The intuition is wonderfully simple: a trustworthy learner shouldn’t change its mind much if you swap out one training example. If deleting or replacing a single data point barely moves the learned model, then the model can’t be memorizing individual points, so its training error must track its true error.

This is algorithmic stability. Formally, an algorithm \(A\) is uniform-stability \(\beta\) if for any two training sets \(S, S'\) differing in one example, the loss on any point changes by at most \(\beta\): \[ \sup_{z}\,\big|\,\ell(A_S, z) - \ell(A_{S'}, z)\,\big| \le \beta. \]

In words: retrain on a dataset with one point swapped, and no individual prediction’s loss can wobble by more than \(\beta\). Also written: for all neighboring \(S\simeq S'\) and all \(z\), \(\ell(A_S,z)\le \ell(A_{S'},z)+\beta\) — a Lipschitz-in-the-dataset condition on the trained predictor.

The payoff: a \(\beta\)-stable algorithm has expected generalization gap bounded by \(\beta\) (typically \(O(1/n)\) for regularized methods). This is exactly why regularization generalizes — adding a \(\frac{\lambda}{2}\|w\|^2\) penalty makes the minimizer move less when data changes (larger \(\lambda\) → more stable → smaller gap, at the cost of higher bias). Ridge regression, SVMs, and weight decay all earn their generalization guarantees through stability, not VC counting. Stochastic gradient descent run for few enough steps is stable too, which is one modern explanation for why early stopping helps.

The picture below contrasts the two regimes: swap one data point in and out, and watch how much the fitted line jumps. The lightly-regularized model lurches (unstable, big gap); the heavily-regularized one barely flinches (stable, small gap).

# scikit-learn: stability in action — stronger ridge penalty = smaller change
# when one training row is swapped out, and a smaller train/test gap.
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=40, noise=10.0, random_state=0)
Xtr, ytr, Xte, yte = X[:150], y[:150], X[150:], y[150:]
for lam in [0.01, 1.0, 100.0]:
    base = Ridge(alpha=lam).fit(Xtr, ytr)
    pert = Ridge(alpha=lam).fit(Xtr[1:], ytr[1:])          # drop one example
    wobble = np.abs(base.predict(Xte) - pert.predict(Xte)).mean()
    gap = base.score(Xtr, ytr) - base.score(Xte, yte)      # train R^2 - test R^2
    print(f"lambda={lam:6.2f}  prediction wobble={wobble:6.3f}  gen-gap={gap:.3f}")

42.9 Online learning and regret: no need to assume a distribution

Everything so far assumed i.i.d. data from a fixed \(\mathcal{D}\). Online learning throws that away: an adversary may choose each example, even reacting to your past predictions. We can no longer talk about “true risk,” so we change the goal to regret — how much worse we did than the single best fixed strategy in hindsight:

\[ \text{Regret}_T = \sum_{t=1}^{T} \ell_t(w_t) \;-\; \min_{w\in\mathcal{W}} \sum_{t=1}^{T} \ell_t(w). \]

In words: add up the loss you actually paid round by round, then subtract the loss of the best single fixed decision you could have committed to from the start knowing everything in advance — that difference is your regret. Also written: \(\text{Regret}_T = \big(\sum_t \ell_t(w_t)\big) - \min_{w}\sum_t \ell_t(w)\); “no-regret” means \(\text{Regret}_T = o(T)\), equivalently \(\tfrac1T\text{Regret}_T \to 0\).

Everyday picture. You bet on a different horse each day. At the end of the season, regret is “how much more I’d have won if I’d just backed the one horse that turned out best all season.” You can’t know that horse in advance — but a no-regret strategy guarantees that, on average per day, you eventually match it.

A learner is no-regret if \(\text{Regret}_T / T \to 0\): over time its average loss matches the best fixed action, even against an adversary.

The experts problem & weighted majority. \(N\) experts each predict daily; you must commit to one. The multiplicative weights algorithm keeps a weight \(w_i\) per expert, predicts by weighted vote, and after seeing the outcome down-weights wrong experts by a factor \((1-\eta)\). Its regret is bounded by \(O(\sqrt{T\ln N})\) — sublinear, hence no-regret, and only logarithmic in the number of experts (the same \(\ln N\) flavor as PAC).

import numpy as np
# multiplicative weights over N experts, T rounds (losses in [0,1])
def mw(losses, eta=0.1):              # losses: shape (T, N)
    T, N = losses.shape
    w = np.ones(N); total = 0.0
    for t in range(T):
        p = w / w.sum()               # play distribution over experts
        total += p @ losses[t]        # expected loss this round
        w *= (1 - eta) ** losses[t]   # punish high-loss experts
    best = losses.sum(0).min()        # best single expert in hindsight
    return total - best               # regret
rng = np.random.default_rng(0)
L = rng.random((500, 10))
print(round(mw(L), 2))                # small vs T=500 -> sublinear regret

Online convex optimization (OCO) & FTRL. Generalize experts to any convex decision set \(\mathcal{W}\) with convex losses. The unifying algorithm is Follow-the-Regularized-Leader: at each step play the point minimizing all past losses plus a regularizer, \[ w_{t+1} = \arg\min_{w\in\mathcal{W}} \;\sum_{s=1}^{t}\ell_s(w) + \frac{1}{\eta} r(w). \]

In words: at every round, pick the decision that would have been best on everything seen so far, but nudged by a regularizer that stops you from over-reacting to the latest data. Also written: equivalently \(w_{t+1}=\arg\min_w \big\langle \sum_{s\le t}\nabla\ell_s(w_s),\,w\big\rangle + \tfrac1\eta r(w)\) in its linearized form; with \(r(w)=\tfrac12\|w\|^2\) this unrolls to the gradient step \(w_{t+1}=w_t-\eta\nabla\ell_t(w_t)\).

The regularizer \(r\) prevents the wild swings that “follow the leader” alone would suffer. Plainly: if you always jump to whatever was best so far, a single noisy day flips your whole strategy and you thrash. The regularizer is a bit of inertia — it keeps you near your previous choice so you only drift when the accumulated evidence is strong. With a strongly convex \(r\), FTRL achieves \(O(\sqrt T)\) regret; online gradient descent is the special case \(r(w)=\tfrac12\|w\|^2\). FTRL is not a toy — it is the workhorse behind large-scale ad click-through recommender systems.

Tip

Rule of thumb. “Follow the regularized leader, never the bare leader.” Any online rule that reacts to the single most recent loss with no inertia (plain follow-the-leader, an unclipped learning rate) will oscillate and rack up linear regret on adversarial sequences. The cure is always the same shape: add a strongly-convex regularizer (or equivalently, a small-enough step size) so each update is a nudge, not a leap.

graph LR
  E["Experts /<br/>Multiplicative Weights"] --> O["Online Convex<br/>Optimization"]
  O --> F["Follow-the-Regularized-Leader<br/>(FTRL)"]
  F --> G["Online Gradient Descent<br/>(r = ½‖w‖²)"]
  F --> H["Exponentiated Gradient<br/>(r = entropy)"]

42.10 Modern deep-learning theory: overparameterization, NTK, double descent

Classical theory predicts disaster for modern nets: with more parameters than data points (\(p \gg n\)), VC dimension dwarfs \(n\) and the bounds go vacuous. Yet these networks fit training data exactly — even random labels — and still generalize on real data. Resolving this paradox is the frontier.

Overparameterization helps optimization. When \(p \gg n\), the loss landscape changes character: the set of global minima becomes a high-dimensional connected manifold, and (S)GD reliably finds one. More parameters make the optimization easier, not just the fit.

The Neural Tangent Kernel (NTK). In the infinite-width limit, a network’s training dynamics linearize around initialization: it behaves like kernel regression with a fixed kernel, the NTK, determined by the architecture (the eigenstructure of this kernel governs which functions are learned fastest). This gives the first regime where wide-net training is provably analyzable — gradient descent converges to a global minimum and the function it learns is characterized in closed form. The limitation: real finite nets move away from initialization and learn features, which the fixed-kernel NTK cannot capture. NTK explains trainability, not the full power of deep learning.

Implicit regularization of (S)GD. Among the infinitely many weight settings that fit the data exactly, GD does not pick an arbitrary one — it is biased toward simple, low-norm, large-margin solutions. For separable logistic regression, gradient descent provably converges to the max-margin solution (connecting straight back to Section 42.6). The optimizer itself is a regularizer; no explicit penalty needed.

Double descent. Plot test error against model size and you do not see the classical U-shape extended — you see something stranger. As capacity rises, error first descends (classical regime), then peaks exactly at the interpolation threshold where the model has just enough parameters to fit the training set perfectly (\(p \approx n\)), then descends again into the overparameterized regime, often reaching a lower error than the classical sweet spot.

Double descent reconciles the apparent contradiction: “more capacity always hurts” is only true up to the interpolation threshold. Beyond it, among many interpolating solutions, implicit regularization selects smooth ones, and error falls again. This is why “just make the model bigger” so often works in practice — and why classical bias–variance intuition (Section 42.7), while not wrong, is incomplete.

Warning

Pitfall. Double descent does not abolish overfitting. Right at the interpolation peak — a model just barely able to memorize — is the worst place to be: highly sensitive to noise. The safe regimes are well below (classical underparameterized) or well above (heavily overparameterized) the threshold. Landing a model at \(p\approx n\) is the danger zone.

42.11 PAC-Bayes: bounds that actually track deep nets

VC and Rademacher bounds become vacuous for deep nets because they charge for the worst hypothesis in a giant class. PAC-Bayes sidesteps this by bounding the average error of a distribution over hypotheses, and — this is the magic — it produces some of the only non-vacuous generalization bounds anyone has computed for real neural networks.

Intuition. Instead of asking “could any network in this class be a disaster?”, PAC-Bayes asks “if I add a little noise to my trained weights, how does the noisy ensemble do, and how far did training drag the weights from where they started?” A model that sits in a wide, flat basin of the loss landscape tolerates weight noise without its loss exploding — and PAC-Bayes rewards exactly that flatness with a tighter bound. This is the rigorous version of the folklore that flat minima generalize better.

Let \(P\) be a prior over hypotheses (chosen before seeing data, e.g. the random initialization) and \(Q\) the posterior (your trained, possibly-noised weights). Then with probability \(\ge 1-\delta\): \[ \mathbb{E}_{h\sim Q}[R(h)] \le \mathbb{E}_{h\sim Q}[\hat R(h)] + \sqrt{\frac{\mathrm{KL}(Q\,\|\,P) + \ln\frac{n}{\delta}}{2(n-1)}}. \]

In words: the average true error over your posterior is bounded by the average training error plus a cushion driven by how far the posterior drifted from the prior (the KL divergence) — move less from initialization, and you generalize better. Also written: the gap obeys \(\mathbb{E}_Q[R] - \mathbb{E}_Q[\hat R] \le \sqrt{\big(\mathrm{KL}(Q\|P)+\ln(n/\delta)\big)/(2(n-1))}\), with \(\mathrm{KL}(Q\|P)=\mathbb{E}_{h\sim Q}[\ln \frac{Q(h)}{P(h)}]\) playing the complexity role that \(\ln|\mathcal H|\) and \(d_{VC}\) played before.

The lesson: the right “complexity” of a learned net is not its parameter count but how much information training had to write into the weights beyond the prior. A network that fits the data while staying close to a generic initialization is, in a precise sense, simple — and PAC-Bayes certifies it.

Where this shows up. PAC-Bayes is not only a proof technique — it is also optimized directly. Dziugaite and Roy (2017) trained a network on MNIST by minimizing the PAC-Bayes bound itself, producing a certificate guaranteeing under ~20% test error before ever touching the test set — the first non-vacuous guarantee for a real net. The same flat-minima intuition powers practical tools like Sharpness-Aware Minimization (SAM), which explicitly seeks weight regions where the loss stays low under perturbation — engineering for exactly the flatness PAC-Bayes rewards.

42.12 Nonconvex landscapes: when local minima are global

Deep learning optimizes nonconvex objectives, where in principle gradient descent could stall in a bad local minimum. In practice it rarely does — and for several important problems we can prove why.

The key structural property is strict saddle: every critical point is either a global minimum or a saddle point with a direction of strict negative curvature (an eigenvalue of the Hessian that is negative). At such saddles, gradient methods with a little noise — like SGD — slide off rather than getting stuck. Combined with the absence of spurious (non-global) local minima, this guarantees convergence to a global optimum.

Several canonical problems provably have this benign structure:

PCA / matrix factorization. Finding the top-\(k\) subspace is nonconvex in the factored parameterization, yet all local minima are global; the only other critical points are strict saddles.
Matrix completion (the Netflix-style problem) under standard incoherence assumptions: no spurious local minima.
Phase retrieval and certain shallow neural networks: similar landscape guarantees.

The lesson: nonconvexity alone is not the enemy. Structured nonconvex problems can be as solvable as convex ones, and characterizing which losses have the strict-saddle / no-spurious-minima property is an active bridge between optimization theory and the empirical success of deep learning.

42.1 — Quick reference

Term / formula	One-line meaning	When / why it matters
Generalization gap \(R(h)-\hat R(h)\)	True error minus measured (training) error	The single quantity all of learning theory bounds
PAC learnability	Probably (\(\ge 1{-}\delta\)) approximately (\(\le\varepsilon\)) correct with poly samples	Defines what “learnable” even means
Realizable sample complexity \(\tfrac1\varepsilon(\ln\lvert\mathcal H\rvert+\ln\tfrac1\delta)\)	Examples needed when a perfect \(h^\star\) exists	Noise-free case; scales as \(1/\varepsilon\)
Agnostic sample complexity \(\tfrac1{2\varepsilon^2}(\ln\lvert\mathcal H\rvert+\ln\tfrac1\delta)\)	Examples needed when no perfect \(h\) exists	Noisy case; the \(1/\varepsilon^2\) penalty for noise
Hoeffding \(\Pr[\lvert\bar X-\mathbb E\bar X\rvert\ge t]\le 2e^{-2nt^2}\)	Bounded averages concentrate exponentially in \(n\)	The workhorse behind every uniform bound
Union bound	\(\Pr[\text{any of }\lvert\mathcal H\rvert\text{ fail}]\le\lvert\mathcal H\rvert\,\delta'\)	Pays the price of searching a finite class
VC dimension \(d_{VC}\)	Largest point set the class can shatter (all \(2^k\) labelings)	Effective capacity of an infinite class
VC gap \(O(\sqrt{d_{VC}/n})\)	True-vs-train gap scales as \(\sqrt{\text{capacity}/\text{data}}\)	“Need several times more rows than columns”
Rademacher complexity \(\hat{\mathfrak R}_S(\mathcal H)\)	How well the class fits random \(\pm1\) noise	Data-dependent, tighter than VC; gives margin bounds
Margin \(\gamma\)	Distance from boundary to nearest point (\(\mathfrak R\sim R/(\gamma\sqrt n)\))	Why wide-margin SVMs generalize regardless of dimension
Approximation–estimation split	Excess risk = bias term + variance term	The U-shaped test-error curve decomposed
Algorithmic stability \(\beta\)	Loss changes \(\le\beta\) when one example is swapped	Why regularization & early stopping generalize
Regret \(_T=\sum_t\ell_t(w_t)-\min_w\sum_t\ell_t(w)\)	Loss vs. best fixed action in hindsight	Goal in online (no-distribution) learning
FTRL / multiplicative weights	Follow the regularized leader; \(O(\sqrt T)\) regret	No-regret online learning; ad-prediction systems
NTK	Infinite-width nets \(\approx\) kernel regression	First provably analyzable wide-net training regime
Double descent	Test error falls again past the interpolation threshold	Why “just make it bigger” works; \(p\approx n\) is the danger zone
PAC-Bayes \(\sqrt{(\mathrm{KL}(Q\Vert P)+\ln\frac n\delta)/2(n{-}1)}\)	Bound the noised-weight average; complexity = drift from prior	Only non-vacuous bounds for real deep nets; flat minima
Strict saddle	Every critical point is global min or escapable saddle	Why nonconvex PCA / matrix completion still solve globally

42.2 — Key takeaways

Generalization is the gap \(R(h)-\hat R(h)\). All of learning theory bounds it with high probability, trading sample size \(n\) against hypothesis-class complexity.
PAC asks for probably (confidence \(1-\delta\)) approximately correct (error \(\le\varepsilon\)) learning; sample complexity scales as \(1/\varepsilon\) (realizable) or \(1/\varepsilon^2\) (agnostic), and only logarithmically in \(\lvert\mathcal{H}\rvert\).
Concentration inequalities are the toolkit. Hoeffding’s exponential tail is the workhorse; its strength is what lets the union bound cover an entire finite class at once.
VC dimension (combinatorial, via shattering) and Rademacher complexity (data-dependent, via fitting random noise) extend bounds to infinite classes. The margin governs SVM/kernel generalization independent of dimension.
Approximation–estimation is the bias–complexity tradeoff: bigger \(\mathcal{H}\) lowers bias, raises variance — the classical U-curve.
Algorithmic stability gives an algorithm-centric guarantee: a learner that barely changes when one example is swapped generalizes, which is why regularization and early stopping work.
Online learning drops the i.i.d. assumption and targets regret; multiplicative weights and FTRL are no-regret (\(O(\sqrt T)\), with \(\ln N\) dependence on experts).
Modern theory explains overparameterized nets: NTK makes wide-net training analyzable, (S)GD has implicit max-margin bias, and double descent shows test error can fall again past the interpolation threshold — but the threshold itself is the danger zone.
PAC-Bayes delivers the only non-vacuous bounds for real nets: complexity is distance moved from the prior (flat minima), not parameter count.
Strict-saddle structure (PCA, matrix completion) explains when nonconvex problems still reach global optima.

42.3 — See also

Chapter on Probability & Statistics — expectations, variance, and the law of large numbers underlying every concentration bound.
Chapter on Optimization — convexity, gradient descent, and SGD, generalized here by online convex optimization and FTRL.
Chapter on SVMs & Kernels — the margin-maximization objective whose generalization is justified by Rademacher/margin bounds (42.6).
Chapter on Deep Learning Foundations — the architectures whose surprising generalization motivates NTK, implicit regularization, and double descent (42.10).
Chapter on Regularization & Model Selection — bias–variance in practice, cross-validation, and capacity control (42.7, 42.8).
Chapter on Frontier & Emerging Directions — where the gap between classical theory and deep-learning practice is still being closed.

↪ The thread continues → Chapter 43 · 🔎 Information Retrieval & Data Mining

Theory tells us when models generalize; at web scale the practical twin problem is finding the right information fast — information retrieval and data mining.

📖 All chapters | ← 41 · 🤖 Robotics & Autonomy | 43 · 🔎 Information Retrieval & Data Mining →