Chapter 12 — 🎯 Model Evaluation & Tuning

📖 All chapters | ← 11 · 🔮 Clustering & Unsupervised Learning | 13 · 🕸️ Probabilistic Graphical Models →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Training a model is the easy part; knowing whether it will actually work on data it has never seen is the hard part. This chapter is the discipline that separates a number that looks good in a notebook from a model you can trust in production. It sits at the end of the classical-ML workflow — after you have algorithms (Regression, Classification, Ensembles, Clustering), before you ship them (MLOps & Deployment) — and its core question is always the same: how well does this generalize, and how do I make it generalize better?

🧭 In context: The ML Workflow / Classical ML · used to estimate true (out-of-sample) performance and pick model settings · the one key idea: measure on data the model never touched, and tune to minimize that out-of-sample error.

💡 Remember this: A model is only as good as its score on data it never saw during fitting — so measure on held-out data, mind the train–test gap, and pick the metric that matches the cost of your mistakes.

12.1 — Train/test split & cross-validation (k-fold, stratified, time-series)

The single most important rule in machine learning: never judge a model on the data it learned from. A model can memorize its training set and score 100% on it while being useless on anything new. To estimate real performance you hold out data the model never sees during fitting.

The simplest scheme is a train/test split: randomly partition the data, train on (say) 80%, and report the metric on the held-out 20% — the test set. The test score is your estimate of how the model behaves on fresh data. Often you carve out three sets, not two: a training set to fit on, a validation set to tune on (Sections 12.4–12.8), and a test set you touch exactly once at the very end. Touch the test set more than once and it quietly becomes a second training set.

The problem with a single split: your estimate is noisy. Get an unlucky 20%, and the number swings. k-fold cross-validation (CV) fixes this by reusing the data. Split into \(k\) equal folds; train on \(k-1\) of them and test on the one left out; rotate so every fold is the test set exactly once; average the \(k\) scores. Every row is used for testing exactly once and for training \(k-1\) times, so you get a stable estimate without wasting data.

The averaged estimate is just the mean of the per-fold scores:

\[\text{CV score} = \frac{1}{k}\sum_{i=1}^{k} \text{score}_i\]

In words: add up the score you got on each of the \(k\) held-out folds and divide by the number of folds — the typical performance across all the held-out trials.

Also written: \(\widehat{\text{CV}} = \mathbb{E}_i[\text{score}_i]\), the sample mean of the fold scores \(\{\text{score}_1,\dots,\text{score}_k\}\).

The figure below makes the rotation concrete: the orange test fold slides along the data, one fold per round, until every block has been the test set exactly once.

flowchart LR
  D[Full dataset] --> F1[Fold 1]
  D --> F2[Fold 2]
  D --> F3[Fold 3]
  D --> F4[Fold 4]
  D --> F5[Fold 5]
  subgraph round1[Round 1]
    T1[test: F1] --- R1[train: F2-F5]
  end
  subgraph round5[Round 5]
    T5[test: F5] --- R5[train: F1-F4]
  end
  F1 --> round1
  F5 --> round5

Worked example. 200 rows, 5-fold CV. Each fold = 40 rows. You fit 5 models; suppose their accuracies are 0.81, 0.79, 0.84, 0.78, 0.83. The CV estimate is the mean \(= 0.81\), with a spread (std \(\approx 0.024\)) that tells you how stable the model is. A single 80/20 split would have given you just one of those five numbers — and if it happened to be the 0.78 fold, you’d have walked away thinking the model was worse than it is.

import numpy as np
def kfold_indices(n, k, seed=0):
    idx = np.random.default_rng(seed).permutation(n)
    folds = np.array_split(idx, k)              # k roughly-equal index groups
    for i in range(k):
        test = folds[i]
        train = np.concatenate([folds[j] for j in range(k) if j != i])
        yield train, test

scores = []
for tr, te in kfold_indices(200, 5):
    # model.fit(X[tr], y[tr]); scores.append(model.score(X[te], y[te]))
    pass
# print(np.mean(scores), np.std(scores))

In practice you rarely hand-roll the loop — scikit-learn gives you the three flavours directly, and they all plug into cross_val_score:

from sklearn.model_selection import (
    cross_val_score, KFold, StratifiedKFold, TimeSeriesSplit)
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=0)

# plain k-fold (shuffles freely)
cv = KFold(n_splits=5, shuffle=True, random_state=0)
# classification → keep class balance in every fold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
# ordered data → train only on the past, test on the future
cv = TimeSeriesSplit(n_splits=5)

scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(scores.mean(), scores.std())   # estimate ± stability

Two important variants exist because the plain shuffle assumed two things that aren’t always true: that classes are balanced, and that rows are independent of order.

Stratified k-fold is for classification: it keeps each fold’s class proportions equal to the whole dataset’s. If 10% of your data is the positive class, every fold should be ~10% positive. Without stratification, a rare class can land entirely outside a fold — that fold then has zero examples of it to test on, and the fold’s score becomes meaningless. The fix is to sample each class separately into the folds.

Time-series CV is for ordered data, where you must never train on the future and test on the past — that’s a fantasy a deployed model never gets. Use expanding (or rolling) windows: train on days 1–30, test on 31–40; train on 1–40, test on 41–50; and so on. The test fold is always later than the training data, mirroring how the model will really be used.

Standard k-fold (shuffles freely):   [test][train][train]   ← order ignored
Time-series CV (forward only):
  train ▓▓▓▓▓·········  test ░░
  train ▓▓▓▓▓▓▓▓·····   test   ░░
  train ▓▓▓▓▓▓▓▓▓▓▓·    test     ░░

Tip

Rule of thumb: \(k=5\) or \(k=10\) is the standard. Higher \(k\) → less bias in the estimate but more compute and higher variance between folds. For classification, default to stratified. For anything with a time index, default to time-series CV — a normal shuffle quietly leaks the future.

12.2 — Bias–variance tradeoff (revisited concretely)

Every model’s expected error on new data decomposes into three parts:

\[\text{Error} = \underbrace{\text{Bias}^2}_{\text{wrong assumptions}} + \underbrace{\text{Variance}}_{\text{sensitivity to the sample}} + \underbrace{\sigma^2}_{\text{irreducible noise}}\]

In words: the error you should expect on fresh data is how far off your model is on average (bias, squared), plus how much its predictions wobble from one training sample to another (variance), plus the noise baked into the data that no model can remove.

Also written: for a point \(x\), \(\mathbb{E}\big[(y - \hat f(x))^2\big] = \big(\mathbb{E}[\hat f(x)] - f(x)\big)^2 + \mathrm{Var}\big(\hat f(x)\big) + \sigma^2\), where \(f\) is the true function and \(\hat f\) is the model fit on a random training set.

Bias is error from the model being too simple to capture the truth — a straight line trying to fit a curve. Variance is error from the model being so flexible it chases the noise in this particular training sample, so it would fit a different sample very differently. Irreducible noise (\(\sigma^2\)) is the randomness in the data you can never remove no matter how good the model is.

The intuition is a dartboard. High bias = darts clustered tightly but far from the bullseye (consistently wrong in the same way). High variance = darts scattered all around it (inconsistent, no two throws alike). You want tight and centered.

Worked example. Fit polynomials to a noisy sine wave. A degree-1 line has high bias: it misses the curve on both train and test (both errors high → underfitting). A degree-15 polynomial has high variance: it threads every training point (train error ≈ 0) but wiggles wildly between them (test error high → overfitting). A degree-3 or 4 balances the two and minimizes test error. As you increase capacity, bias falls and variance rises; total error traces a U — high on the left from bias, high on the right from variance, lowest in the middle.

The animated curve below makes the U concrete: a marble rolls down the total-error curve and settles at the sweet spot where bias and variance balance.

Tip

You cannot eliminate both — you trade them. More capacity / less regularization → less bias, more variance. The whole art of tuning (Sections 12.4–12.8) is finding the bottom of that U.

Note

A modern footnote — double descent. The clean U-shape is the classical story and it holds for the models in this chapter. But for very over-parameterized models (huge neural nets), pushing capacity past the point of memorizing the training set can make test error fall a second time — the so-called double-descent curve. The classical U still governs the regimes you tune by hand here; double descent is covered with deep nets (Neural Networks).

12.3 — Overfitting & underfitting (diagnosing)

These are the two failure modes named by the tradeoff above, and the good news is they have a clean diagnostic signature you read straight off train-vs-test scores.

Underfitting (high bias) is when the model does poorly on the training data and the test data. It hasn’t even learned the training signal — it’s too simple for the job. Train error high, test error high, and the gap between them small.

Overfitting (high variance) is the opposite: the model does great on training data but poorly on test data. It memorized the training set rather than learning the general pattern. Train error low, test error high, and the gap large.

A good fit sits between them: both errors low and close together.

Symptom	Train error	Test error	Gap	Diagnosis
Both bad	high	high	small	Underfitting (more capacity / features)
Train good, test bad	low	high	large	Overfitting (regularize / more data)
Both good	low	low	small	Good fit

The gap between train and test error is the key tell. A large gap screams overfitting regardless of the absolute numbers; a small gap at high error screams underfitting.

Worked example. A decision tree with no depth limit hits 100% train accuracy and 72% test accuracy — a 28-point gap, textbook overfitting. Cap the depth at 4: train drops to 86%, test rises to 83%. You gave up training accuracy you never deserved and bought generalization in exchange. The fixes flow directly from the diagnosis — overfit → simplify the model or add data; underfit → add capacity or better features.

Warning

A near-zero training error is not a success — it is usually a warning. The question is never “how well does it fit the training data” but “how large is the train–test gap.”

12.4 — Regularization (L1/L2, weight decay, early stopping)

Regularization is any technique that deliberately constrains a model to prevent it from fitting noise — trading a little training accuracy for better generalization. The most common form adds a penalty on the size of the model’s weights to the loss, so the optimizer is rewarded for keeping weights small unless a large weight really earns its keep.

For a linear model with loss \(L(\mathbf{w})\):

\[L_{\text{ridge}} = L(\mathbf{w}) + \lambda \sum_j w_j^2 \quad(\text{L2}), \qquad L_{\text{lasso}} = L(\mathbf{w}) + \lambda \sum_j |w_j| \quad(\text{L1})\]

In words: keep the usual “how wrong are my predictions” loss, but add a fine that grows with how big the weights are — squared size for Ridge, absolute size for Lasso — so the optimizer only buys a large weight if it pays for itself in fit.

Also written: \(\min_{\mathbf{w}} L(\mathbf{w}) + \lambda\lVert\mathbf{w}\rVert_2^2\) (Ridge) and \(\min_{\mathbf{w}} L(\mathbf{w}) + \lambda\lVert\mathbf{w}\rVert_1\) (Lasso), where \(\lVert\mathbf{w}\rVert_2^2=\sum_j w_j^2\) and \(\lVert\mathbf{w}\rVert_1=\sum_j|w_j|\). Equivalently, each is the constrained problem “minimize \(L\) subject to \(\lVert\mathbf{w}\rVert \le t\)” for some budget \(t\) tied to \(\lambda\).

L2 (Ridge) penalizes squared weights. It shrinks all weights smoothly toward zero but rarely to zero. It’s a good default and handles correlated features gracefully, spreading weight across them rather than picking one arbitrarily.

L1 (Lasso) penalizes absolute weights. Its geometry drives many weights exactly to zero, performing automatic feature selection (Dimensionality Reduction covers other ways to cut features) — the surviving nonzero weights are the features the model decided to keep.

The knob \(\lambda\) (the regularization strength) is a hyperparameter: \(\lambda = 0\) is no regularization at all; large \(\lambda\) forces a simpler model with smaller weights. You tune it with CV (Section 12.8).

Why does L1 zero things out but L2 doesn’t? Picture the optimizer as a ball that wants to roll to the lowest point of the loss, but it’s fenced inside a “weight budget” region — and the shape of that fence decides where it stops. L1’s fence is a diamond with sharp corners that sit exactly on the axes (where one weight is zero). The loss tends to bump into the fence at a corner, so a weight snaps to zero. L2’s fence is a circle — smooth, no corners — so the ball settles at a generic edge point where every weight is small but none is exactly zero. Corners create sparsity; round edges don’t.

Worked example. Three correlated features with true coefficients roughly \([3, 0, 0]\) — only the first matters. Unregularized least squares might split the signal across all three as \([1.4, 0.9, 0.8]\), chasing noise. L2 shrinks them toward each other and toward zero, say \([1.7, 0.6, 0.5]\) — smaller, smoother, but still nonzero. L1 with the right \(\lambda\) snaps the two useless ones to exactly \([2.6, 0, 0]\), recovering the sparse truth and telling you which feature to keep.

from sklearn.linear_model import Ridge, Lasso
# alpha is scikit-learn's name for the regularization strength λ
Ridge(alpha=1.0).fit(X, y)   # L2: all weights small, none exactly 0
Lasso(alpha=0.1).fit(X, y)   # L1: many weights driven exactly to 0
# inspect (lasso.coef_ == 0).sum() to see how many features were dropped

In neural networks, L2 regularization is usually called weight decay, because at each gradient step it pulls every weight a little toward zero: \(w \leftarrow w - \eta(\nabla L + \lambda w) = (1-\eta\lambda)\,w - \eta\nabla L\) — the \((1-\eta\lambda)\) factor literally decays the weight each step before the gradient update is applied.

In words: before applying the usual gradient step, scale every weight down by a tiny factor \((1-\eta\lambda)\) — so weights constantly leak toward zero unless the gradient keeps pushing them back up.

Also written: with learning rate \(\eta\) and decay \(\lambda\), \(w_{t+1} = (1-\eta\lambda)\,w_t - \eta\,\partial L/\partial w\). In PyTorch this is the weight_decay argument of the optimizer:

import torch
opt = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-2)
# weight_decay is the λ above — AdamW applies the decay term correctly,
# decoupled from the adaptive gradient scaling.

Early stopping is regularization through time. As you train iteratively, training loss keeps falling, but validation loss eventually turns back up — that turning point is where the model stops generalizing and starts memorizing. Stop there: monitor validation loss, keep the weights from its best epoch, and halt after it fails to improve for a set patience number of epochs.

best, wait, patience = float('inf'), 0, 5
for epoch in range(1000):
    # train one epoch; val = evaluate(val_set)
    val = ...
    if val < best:
        best, wait = val, 0          # improved → save these weights
    else:
        wait += 1
        if wait >= patience:         # no improvement for `patience` epochs
            break                    # stop before overfitting

Tip

Start with L2 as a safe default. Reach for L1 when you suspect many features are useless and want the model to pick. In deep nets, weight decay + early stopping together are the workhorse pair. (Linear models in depth: Chapter 8.)

12.5 — Dropout (as regularization)

Dropout is a regularizer designed for neural networks. During each training step it randomly “drops” (sets to zero) a fraction \(p\) of the neurons in a layer — each neuron is silenced independently with probability \(p\). The network must therefore learn redundant, robust features rather than relying on any single neuron, because that neuron might vanish on the next step.

The intuition is that dropout trains an ensemble. Each random mask defines a slightly different sub-network; over training you implicitly train exponentially many of them sharing weights, and at test time using the full network approximates averaging all their predictions. Ensembling reduces variance (Ensemble Methods) — which is exactly what fights overfitting.

The doodle below shows the same little network on three training steps: a different random subset of neurons goes dark each time, so no single unit can be relied on.

flowchart LR
  subgraph Train["Training step (p=0.5)"]
    i1((x1)) --> h1((h1))
    i1 --> h2(("h2 ✗"))
    i2((x2)) --> h3((h3))
    i2 --> h4(("h4 ✗"))
    h1 --> o((y)); h3 --> o
  end

There’s a scaling subtlety. With dropout rate \(p\), only a fraction \((1-p)\) of neurons are active during training, so a layer’s summed input is smaller than it would be at test time when all neurons fire. To keep the expected magnitude consistent, modern implementations use inverted dropout: during training, divide the surviving activations by \((1-p)\). Then at test time you do nothing special — just run the full network as is.

def dropout(a, p, train=True):
    if not train:
        return a                              # test: use full activations
    mask = (np.random.rand(*a.shape) > p)     # keep with prob (1-p)
    return a * mask / (1 - p)                 # zero some, scale the rest up

In a real framework you never write that by hand — you drop in a layer and toggle train/eval mode, which flips dropout on and off for you:

import torch.nn as nn
net = nn.Sequential(nn.Linear(128, 64), nn.ReLU(),
                    nn.Dropout(p=0.5),       # drops 50% of units in training
                    nn.Linear(64, 10))
net.train()   # dropout ACTIVE (and scaling applied)
net.eval()    # dropout OFF — full network, deterministic predictions

Worked example. A layer outputs activations [2, 4, 6, 8] with \(p=0.5\). A random mask keeps neurons 1 and 3: result [2, 0, 6, 0], then scale by \(1/(1-0.5)=2\) → [4, 0, 12, 0]. The kept activations are boosted so the layer’s expected total is unchanged. At test time the layer simply outputs [2, 4, 6, 8] with no scaling and no dropping.

Warning

Dropout is a training-only operation. Forgetting to disable it at inference (e.g. not calling model.eval() in PyTorch) injects random noise into your predictions and silently tanks accuracy. Also: don’t stack heavy dropout on a model that’s already underfitting — you’ll just make the bias worse.

12.6 — Confusion matrix & precision/recall/F1

Accuracy — the fraction of predictions that are correct — is a trap on imbalanced data. If 99% of transactions are legitimate, a model that blindly predicts “legitimate” for everything scores 99% accuracy while catching zero fraud. You need metrics that look at what kind of mistakes happen, and that starts with the confusion matrix: a table cross-tabulating predicted vs. actual classes.

For binary classification it has four cells: TP (true positive — predicted positive, was positive), TN (true negative — predicted negative, was negative), FP (false positive — predicted positive but actually negative; a false alarm), and FN (false negative — predicted negative but actually positive; a miss).

From these four numbers come the three metrics that matter most:

\[\text{Precision} = \frac{TP}{TP+FP}, \qquad \text{Recall} = \frac{TP}{TP+FN}, \qquad F_1 = 2\cdot\frac{\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}\]

In words: precision is “of the things I called positive, how many really were?”; recall is “of the things that really were positive, how many did I catch?”; F1 is a single score that’s only high when both of those are high.

Also written: with \(P\) and \(R\) for precision and recall, \(F_1 = \dfrac{2TP}{2TP+FP+FN} = \left(\dfrac{P^{-1}+R^{-1}}{2}\right)^{-1}\) — the harmonic mean of \(P\) and \(R\).

Precision answers: of everything I flagged positive, what fraction really was positive? It’s the purity of the alarms. Recall (also called sensitivity) answers: of all the real positives out there, what fraction did I catch? It’s coverage. F1 is their harmonic mean — a single number that stays low unless both precision and recall are decent, which makes it a far better summary than accuracy on imbalanced data.

Worked example. 1000 emails, 50 of them spam. The filter flags 40 emails as spam; 30 of those are truly spam.

TP = 30, FP = 10 (good mail wrongly flagged), FN = 20 (spam that slipped through), TN = 940.
Precision = 30/40 = 0.75 — three of four flagged emails really were spam.
Recall = 30/50 = 0.60 — it caught 60% of the actual spam.
F1 = 2·(0.75·0.60)/(0.75+0.60) = 0.667.
Accuracy = (30+940)/1000 = 0.97 — flattering and nearly useless here, because the 940 easy true negatives dominate.

from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_true, y_pred))         # [[TN, FP], [FN, TP]]
print(classification_report(y_true, y_pred))    # precision/recall/F1 per class

Which metric to favor depends on the cost of each error type:

Situation	Costly error	Optimize for
Spam filter	FP (real mail in spam folder)	Precision
Cancer screening	FN (missed disease)	Recall
Fraud / search ranking	both matter	F1

Tip

Say the metric in plain words before computing it: precision = “when I shout, am I right?”; recall = “do I catch them all?”. There’s an inherent tension — push recall up by flagging more aggressively, and precision usually drops. F1 keeps you honest about the balance.

Note

Beyond F1 — when “both matter” isn’t 50/50. \(F_1\) weights precision and recall equally. When one genuinely costs more, use \(F_\beta = (1+\beta^2)\dfrac{P\cdot R}{\beta^2 P + R}\), where \(\beta>1\) favors recall and \(\beta<1\) favors precision (e.g. \(F_2\) for cancer screening, \(F_{0.5}\) for a spam filter). For multi-class problems you also pick an averaging rule: macro (unweighted mean across classes — every class counts the same, good when rare classes matter) vs micro/weighted (pool the counts — dominated by the big classes).

12.7 — ROC, AUC & thresholds (and PR curves for imbalance)

Most classifiers don’t output a hard label; they output a score or probability, and you pick a threshold above which you call it positive. Precision and recall are computed at one threshold. Slide the threshold and they change — lower it and you flag more (recall up, precision usually down); raise it and you flag less. So to judge the model itself, independent of any one cutoff, we sweep the threshold across its whole range and plot the result.

The animation below shows that sweep: as the dashed threshold line glides from strict to lenient, more cases get flagged positive and the operating point travels up the ROC curve.

The ROC curve (Receiver Operating Characteristic) plots the True Positive Rate (= recall, \(TP/(TP+FN)\)) against the False Positive Rate (\(FP/(FP+TN)\)) as the threshold varies from strict to lenient. The top-left corner is perfect (catch everything, no false alarms); the diagonal line is random guessing.

AUC (Area Under the ROC Curve) collapses that whole curve into one number in \([0.5, 1.0]\). It has a beautiful interpretation: AUC is the probability that the model scores a random positive higher than a random negative.

\[\text{AUC} = P\big(\,s(x^{+}) > s(x^{-})\,\big)\]

In words: grab one positive example and one negative example at random; AUC is the chance the model gives the positive the higher score. A perfect ranker always does (AUC = 1); a coin flip is right half the time (AUC = 0.5).

Also written: \(\text{AUC} = \dfrac{1}{|P||N|}\sum_{i\in P}\sum_{j\in N}\mathbf{1}\!\left[s_i > s_j\right]\) — the fraction of all positive–negative pairs the model ranks correctly (the Mann–Whitney U statistic). AUC = 1.0 is perfect ranking; 0.5 is a coin flip.

The imbalance caveat. ROC/AUC can look deceptively good on heavily imbalanced data, because the FPR denominator \((FP+TN)\) is dominated by the huge negative class — a flood of false positives barely moves the FPR. When positives are rare and you care about them, use the Precision–Recall (PR) curve instead: it plots precision against recall and ignores true negatives entirely, so it stays honest. Its summary number is average precision (AP), the area under the PR curve. A no-skill PR baseline isn’t 0.5 — it’s the positive class’s prevalence (e.g. 0.01 for a 1%-positive problem).

Worked example. 10,000 samples, 100 positive (1%). A mediocre fraud model produces 500 false positives while catching all 100 true positives. ROC’s FPR = 500/9900 ≈ 0.05 — looks tiny, and the AUC still reads ~0.9. But precision = 100/(100+500) ≈ 0.17 — six of every seven alerts are false. The PR curve exposes exactly the pain that the ROC curve hides.

import numpy as np
def roc_points(scores, y):
    for t in np.sort(np.unique(scores))[::-1]:   # sweep threshold high→low
        pred = scores >= t
        tpr = ((pred==1)&(y==1)).sum() / max((y==1).sum(),1)
        fpr = ((pred==1)&(y==0)).sum() / max((y==0).sum(),1)
        yield fpr, tpr
# AUC = trapezoidal area under the (fpr, tpr) points

scikit-learn computes both summaries and the full curves for you:

from sklearn.metrics import roc_auc_score, average_precision_score
proba = model.predict_proba(X_test)[:, 1]   # positive-class scores
roc_auc_score(y_test, proba)          # ROC AUC
average_precision_score(y_test, proba)  # AP = area under PR curve

Choosing the threshold. AUC/AP judge the ranking; deployment needs one cutoff. Pick it from the cost of FP vs FN — e.g. choose the threshold that maximizes \(F_1\), or the lowest threshold whose precision still clears a business floor:

from sklearn.metrics import precision_recall_curve
prec, rec, thr = precision_recall_curve(y_test, proba)
f1 = 2 * prec * rec / (prec + rec + 1e-12)
best_threshold = thr[f1[:-1].argmax()]   # operating point, not the lazy 0.5

Warning

On rare-positive problems (fraud, disease, anomalies in Anomaly & Fraud Detection), a high AUC can hide a useless model. Report the PR curve / average precision alongside it, and always pick the operating threshold from the cost of FP vs. FN — not the lazy 0.5 default.

12.8 — Probability calibration

Intuition first: a weather forecaster who says “70% chance of rain” is well calibrated if, across all the days she said 70%, it actually rained on about 70% of them. A classifier’s predict_proba output is exactly such a forecast — and a model can rank cases perfectly (great AUC) while its probabilities are badly miscalibrated, saying “0.9” for cases that are right only 60% of the time. Whenever a downstream decision uses the probability itself — expected-value thresholds, risk scores, pricing, triage — calibration matters as much as accuracy.

Calibration asks whether predicted probabilities match observed frequencies. To check it, bin predictions by their predicted probability and, in each bin, compare the average predicted probability to the actual fraction of positives. Plotting one against the other gives a reliability diagram: the diagonal is perfect calibration, a curve below it means the model is over-confident, above it means under-confident.

A common one-number summary is Expected Calibration Error (ECE) — the average gap between confidence and accuracy across bins:

\[\text{ECE} = \sum_{b=1}^{B} \frac{n_b}{N}\,\big|\,\text{acc}(b) - \text{conf}(b)\,\big|\]

In words: for each confidence bin, take how far its average predicted probability is from the fraction it actually got right, then average those gaps weighted by how many predictions fell in each bin.

Also written: \(\text{ECE} = \mathbb{E}_{\hat p}\big[\,|\,\Pr(y=1\mid \hat p) - \hat p\,|\,\big]\), the expected absolute deviation between predicted and true probability, estimated by binning (\(n_b\) = count in bin \(b\), \(N\) = total).

Two standard fixes refit a small mapping from raw scores to calibrated probabilities, learned on a held-out set:

Platt scaling fits a logistic (sigmoid) on the scores — good when the reliability curve is a smooth S, and works with little data.
Isotonic regression fits a free monotonic step function — more flexible, but needs more data or it overfits.

Worked example. A boosted-tree fraud model has AUC 0.94 but tends to say “0.8” for cases that default only 55% of the time. You price loans off that probability, so the miscalibration costs real money. Fit isotonic regression on a validation fold: ranking (AUC) is unchanged, but the “0.8” bucket now defaults ~0.8 of the time and ECE drops from 0.12 to 0.02. The decisions built on those numbers are finally trustworthy.

from sklearn.calibration import CalibratedClassifierCV
# wrap a fitted (or unfitted) model; refit the score→probability map on CV folds
calibrated = CalibratedClassifierCV(base_model, method="isotonic", cv=5)
calibrated.fit(X_train, y_train)
proba = calibrated.predict_proba(X_test)[:, 1]   # now well-calibrated

Tip

Tree ensembles and SVMs are often poorly calibrated out of the box; plain logistic regression is usually well calibrated already. If anything downstream consumes the probability (not just the rank or the hard label), check a reliability diagram and calibrate if needed. Calibrate on a separate fold from the one you trained on — calibrating on the training set just relearns the same over-confidence.

12.9 — Hyperparameter tuning (grid, random, Bayesian optimization)

Parameters are learned from the data (the weights of a model). Hyperparameters are the knobs you set before training — tree depth, learning rate, the regularization strength \(\lambda\), the number of neighbors \(k\). They aren’t adjusted by the fitting procedure, so you find good values by training many models and scoring each one with cross-validation.

Three strategies, in roughly increasing order of sophistication:

Grid search defines a discrete value list per hyperparameter and tries every combination. It’s exhaustive and dead simple, but cost explodes combinatorially: 4 hyperparameters × 5 values each = \(5^4 = 625\) fits. The curse of dimensionality makes grids impractical past a few dimensions.

Random search samples combinations at random from ranges, for a fixed budget of trials. Counterintuitively it usually beats grid search at equal cost. The reason: when only a couple of hyperparameters actually matter, random search tries many distinct values of those important ones, while a grid wastes most of its budget re-testing the same few values of the important knob across irrelevant ones.

Bayesian optimization builds a probabilistic model (a surrogate, often a Gaussian process) of “hyperparameters → CV score” from the trials so far, then uses it to choose the most promising next point to evaluate — balancing exploitation (search near the current best) against exploration (probe uncertain regions). It finds good settings in far fewer trials, at the cost of more bookkeeping and sequential (less parallel) evaluation. (The broader landscape of search and optimization methods lives in Optimization.)

The picture below shows why random beats grid when one axis is irrelevant: the grid samples only 3 distinct values of the important parameter, while random samples 9.

Worked example. Tuning a gradient-boosted tree over learning rate and max depth. A 5×5 grid is 25 fits but probes only 5 learning rates. Spend the same 25 fits on random search and you probe 25 distinct learning rates — if depth barely matters, that’s 5× the resolution on the knob that does. Bayesian optimization goes further: after 10 random trials it notices high scores cluster around learning rate ≈ 0.1, and spends its remaining budget there instead of wasting fits on values it already knows are bad.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform, randint
from sklearn.ensemble import GradientBoostingClassifier

space = {
    "learning_rate": loguniform(1e-3, 3e-1),  # sample log-scale, not linear
    "max_depth":     randint(2, 8),
    "n_estimators":  randint(100, 600),
}
search = RandomizedSearchCV(
    GradientBoostingClassifier(), space,
    n_iter=25, cv=5, scoring="f1", random_state=0, n_jobs=-1)
search.fit(X, y)
print(search.best_params_, search.best_score_)
# For Bayesian optimization, swap in optuna or skopt's BayesSearchCV —
# same idea, fewer trials to reach the same score.

Method	Trials to good result	Parallel?	Best when
Grid	many (exponential)	yes	≤2–3 hyperparameters, small grids
Random	moderate	yes	many hyperparameters, few that matter
Bayesian	fewest	partly	expensive training, want sample-efficiency

Warning

Tune against a validation set (or inner CV fold), never the test set. If you keep tweaking hyperparameters until the test score looks good, you’ve overfit to the test set and your final number is now optimistic. Use nested CV (an outer loop for honest evaluation, an inner loop for tuning) when you need both an unbiased estimate and tuned hyperparameters. Broader optimization theory: Chapter 3.

🎮 Try it — Hyperparameter Optimization

🎮 Try it — Bayesian Optimization

12.10 — Learning curves (reading them to decide more-data vs more-capacity)

A learning curve plots model performance against the amount of training data: train on 10% of the data, then 20%, … up to 100%, and at each size record both training error and validation error. The shape of the two curves tells you whether your next move should be get more data or build a more powerful model — two expensive choices you really don’t want to guess at.

Read the two curves and the gap between them:

High bias (underfitting) shows up as train and validation errors that converge to each other but at a high error, with both curves flattened out. More data won’t help — the lines have already plateaued, and adding rows just gives you more of the same plateau. You need more capacity: a richer model, more or better features, or less regularization.

High variance (overfitting) shows up as a large gap — low training error, much higher validation error — with the validation curve still falling as data grows. Here more data will help, because the gap shrinks as the model has less room to memorize. Regularizing harder is the alternative.

Worked example. You train an image classifier; at 5k examples train accuracy is 99% and validation 78% — a 21-point gap, and validation is still climbing as you add data. That’s high variance: collecting more labeled images is worth the cost. Contrast a linear model where at 5k examples train is 74% and validation 73%, both flat since 2k — that’s high bias; doubling the data is wasted money, you need a stronger model instead. The learning curve turns “should we spend on data or on modeling?” into a question you can read off a chart.

from sklearn.model_selection import learning_curve
import numpy as np
sizes, train_sc, val_sc = learning_curve(
    model, X, y, cv=5, scoring="accuracy",
    train_sizes=np.linspace(0.1, 1.0, 8), n_jobs=-1)
# plot train_sc.mean(1) and val_sc.mean(1) vs sizes; read the gap and the slope

Tip

Before you pay for more labels, plot the learning curve. If the validation curve has gone flat right next to the training curve, more data is money burned — change the model instead. If there’s still a gap and the curve is descending, more data is the cheapest win you have.

12.11 — Comparing models honestly (statistical significance)

Intuition first: model B beats model A by 0.3% accuracy on your test set. Is B actually better, or did it just get a luckier draw of test points? Cross-validation gives you a distribution of fold scores, not a single number — and a tiny mean difference that’s swamped by fold-to-fold noise is not evidence of anything. Treating it as a real improvement is how teams ship “upgrades” that don’t survive production.

The disciplined move is to test whether the difference is bigger than the noise. Because both models see the same folds, the scores are paired, so you compare them fold-by-fold rather than as two independent samples.

A paired comparison looks at the per-fold differences \(d_i = \text{score}^B_i - \text{score}^A_i\) and asks whether their mean is meaningfully far from zero relative to their spread:

\[t = \frac{\bar d}{s_d / \sqrt{k}}\]

In words: take the average score difference across folds, and divide by the typical size of that difference’s random wobble — a big ratio means the gap is real, a small one means it’s within noise.

Also written: \(t = \dfrac{\bar d \sqrt{k}}{s_d}\), where \(\bar d\) is the mean of the paired fold differences, \(s_d\) their standard deviation, and \(k\) the number of folds — a paired \(t\)-statistic on \(\{d_i\}\).

The two cases below share the same 0.6-point win, but the noise (spread of the per-fold differences) decides whether it counts: tight spread → real, wide spread → indistinguishable from chance.

A small \(p\)-value (say \(<0.05\)) says the improvement is unlikely to be noise (the machinery of \(p\)-values and significance tests lives in Probability & Statistics). The standard caveats apply: CV folds overlap in training data, so they aren’t fully independent and the naive \(t\)-test is optimistic — practitioners often prefer a corrected variant (e.g. Nadeau–Bengio) or a non-parametric test, and on multiple-dataset benchmarks the Wilcoxon signed-rank test. Also separate statistical from practical significance: a real but 0.05% gain may not be worth a more complex, slower model.

Worked example. Across 10 folds, model B beats A by a mean of 0.6 points with a fold-to-fold std of 0.4. Then \(t = 0.6 / (0.4/\sqrt{10}) \approx 4.7\), a \(p\)-value around 0.001 — the gain is real, not luck. Change the std to 1.5 (noisy folds) and \(t \approx 1.3\), \(p \approx 0.22\) — the same 0.6-point “win” is now indistinguishable from chance, and you should keep the simpler model.

import numpy as np
from scipy import stats
from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(5, shuffle=True, random_state=0)   # SAME folds for both
a = cross_val_score(model_a, X, y, cv=cv)
b = cross_val_score(model_b, X, y, cv=cv)
t, p = stats.ttest_rel(b, a)         # paired t-test on per-fold scores
print(b.mean() - a.mean(), p)        # gain, and whether it's likely real

Warning

A single test-set number with no notion of variability invites self-deception. Report a mean and a spread (or confidence interval), compare models on the same folds with a paired test, and remember that the vanilla CV \(t\)-test understates uncertainty because folds share data. “B is 0.2% higher” is not a result; “B is higher with \(p<0.05\) and the gain clears our deployment bar” is.

12.12 — Data leakage warning

Data leakage is when information that won’t be available at prediction time sneaks into training — so the model looks brilliant in evaluation and then collapses in production. It is the single most common cause of “too good to be true” results, and it’s insidious precisely because every offline metric looks great right up until deployment.

Leakage has a few classic forms.

Preprocessing on the full dataset is the most common: computing a scaler’s mean and standard deviation, an imputation value, or a feature-selection ranking (these transforms are the subject of Data Preprocessing) on all the data before splitting lets the training fold peek at the test fold’s statistics. The fix is to fit every transform on the training fold only, then apply it to validation and test. Inside CV, that means the entire preprocessing pipeline goes inside each fold.

Target leakage is a feature that is a proxy for, or is computed from, the label. The classic case: predicting hospital readmission using a discharge_medication column that only exists because the patient was readmitted. The model isn’t predicting — it’s reading the answer off a feature that wouldn’t exist yet at prediction time.

Temporal leakage is using future information to predict the past — a moving average that includes future days, or shuffling time-ordered data so that tomorrow’s rows train the model that’s then tested on today. The fix is to split by time (Section 12.1).

Duplicate / group leakage is when the same entity (a patient, a user, a near-duplicate image) appears in both train and test, so the model “recognizes” the specific instance rather than generalizing. The fix is group-aware splitting that keeps every entity wholly inside one fold.

flowchart TB
  A[Raw data] --> B{Split FIRST}
  B -->|train fold| C[Fit scaler, imputer, selector on TRAIN only]
  C --> D[Transform train]
  B -->|test fold| E[Apply SAME fitted transforms]
  D --> F[Train model]
  E --> G[Evaluate — honest estimate]
  F --> G

Worked example. You standardize features using the mean and standard deviation of the whole dataset, then split and report 95% CV accuracy. In production the model gets 81%. The leak: each test fold’s scaling already encoded that fold’s own distribution, which the model quietly exploited. Refit the scaler inside each fold and the CV number drops to a realistic 82% — uglier, but true. The honest 82% is worth infinitely more than the fake 95%, because it’s the only one that survives contact with production.

The defense is mechanical: bundle every transform with the estimator in a Pipeline, and hand the pipeline to cross-validation. Then each fold refits the scaler on its own training rows only — leakage is impossible by construction.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = make_pipeline(StandardScaler(), LogisticRegression())
# scaler is refit INSIDE every fold — test rows never seen during fitting
scores = cross_val_score(pipe, X, y, cv=5)   # honest estimate

Warning

The tell-tale sign of leakage is a result that seems too good. When CV accuracy is suspiciously high, assume leakage until proven otherwise: audit every feature with “could this value exist before the label is known?”, and make sure all preprocessing lives inside the CV loop. Preventing leakage is the whole reason the Pipeline object exists in scikit-learn (Tools & Frameworks).

12.13 — Quick reference

Term / formula	Meaning	When / why to reach for it
Train/val/test split	Fit on train, tune on val, touch test once	Always; test set is your final honest number
k-fold CV (\(\frac1k\sum_i \text{score}_i\))	Rotate the held-out fold, average \(k\) scores	Stable estimate without wasting data; \(k=5\) or \(10\)
Stratified k-fold	Keep class proportions equal in every fold	Default for classification, especially imbalanced
Time-series CV	Train only on the past, test on the future	Any data with a time index — a shuffle leaks the future
Bias–variance (\(\text{Bias}^2+\text{Var}+\sigma^2\))	Error = wrong-on-average + wobbly + noise	Frame any under/overfit; tune to the bottom of the U
Train–test gap	Distance between train and test error	Big gap = overfit; both bad = underfit
L2 / Ridge (\(\lambda\lVert w\rVert_2^2\))	Shrink all weights smoothly toward zero	Safe default; handles correlated features
L1 / Lasso (\(\lambda\lVert w\rVert_1\))	Drive many weights exactly to zero	Want sparsity / automatic feature selection
Weight decay (\(w\leftarrow(1-\eta\lambda)w-\eta\nabla L\))	L2 applied per gradient step	Standard NN regularizer (`weight_decay` in AdamW)
Early stopping	Halt at the validation-loss minimum	Iterative training; pair with `patience`
Dropout (rate \(p\))	Silence random neurons; implicit ensemble	NN overfitting; disable at inference (`model.eval()`)
Precision (\(TP/(TP+FP)\))	Of what I flagged, how much was right	When false alarms are costly (spam filter)
Recall (\(TP/(TP+FN)\))	Of all positives, how many I caught	When misses are costly (cancer screening)
F1 (harmonic mean of P, R)	One score, high only if both are high	Imbalanced data where both errors matter
ROC / AUC	\(P(\text{score}^+ > \text{score}^-)\), ranking quality	Threshold-free model comparison; balanced-ish data
PR curve / AP	Precision vs recall, ignores true negatives	Rare positives — AUC looks too good there
Calibration / ECE	Do predicted probabilities match frequencies	When a decision consumes the probability itself
Grid / Random / Bayesian	Exhaustive / sampled / surrogate-guided search	Random beats grid; Bayesian for expensive training
Learning curve	Error vs amount of training data	Decide: buy more data (gap closing) vs bigger model (flat)
Paired CV \(t\)-test (\(\bar d\sqrt{k}/s_d\))	Is the fold-to-fold win bigger than the noise	Before declaring model B beats model A
Data leakage	Future/label info sneaks into training	Suspect when results look too good; use a `Pipeline`

12.14 — Key takeaways

Always measure on held-out data. A train/test split gives one noisy estimate; k-fold CV averages several. Use stratified folds for classification and time-series (forward-only) folds for ordered data.
Every model’s error splits into bias (too simple), variance (too sensitive), and irreducible noise; tuning is the search for the bottom of the U-shaped total-error curve.
Diagnose by the train–test gap: both-bad = underfit (add capacity); big-gap = overfit (regularize / add data).
Regularization trades training fit for generalization — L1 selects features (drives weights to zero), L2 / weight decay shrinks them smoothly, early stopping halts at the validation minimum, and dropout ensembles sub-networks.
On imbalanced data accuracy lies. Read the confusion matrix; favor precision when false alarms hurt, recall when misses hurt, F1 when both do; use PR curves / average precision when positives are rare, not just ROC/AUC.
If a decision consumes the probability itself, check calibration (reliability diagram, ECE) and fix it with Platt or isotonic scaling — good ranking does not imply honest probabilities.
Tune hyperparameters with CV — random search usually beats grid; Bayesian optimization is most sample-efficient. Never tune on the test set; use nested CV when you need an unbiased estimate too.
Learning curves tell you whether to buy more data (gap still closing) or a bigger model (curves flat and converged).
Compare models honestly: report a spread, not a lone number, and use a paired significance test on the same folds before declaring a winner.
Data leakage produces results that are too good to be true; put all preprocessing inside the CV loop (use a Pipeline) and audit every feature for future information.

12.15 — See also

Optimization — the gradient methods and search landscapes behind hyperparameter tuning and early stopping.
Probability & Statistics — sampling variance, distributions, significance testing, and the noise term in the bias–variance decomposition.
Regression — L1/L2 (Lasso/Ridge) penalties in their native linear-model home.
Ensemble Methods — why averaging models (and dropout’s implicit ensemble) reduces variance.
Neural Networks (Core) — weight decay, dropout, early stopping, and the double-descent curve as applied in deep learning.
Anomaly & Fraud Detection — evaluation under extreme class imbalance, where PR curves and threshold choice are decisive.
MLOps & Deployment / Tools & Frameworks — pipelines that prevent leakage and automate cross-validated tuning in production.

↪ The thread continues → Chapter 13 · 🕸️ Probabilistic Graphical Models

Evaluation assumes a clean prediction; but often the smarter move is to model the uncertainty and structure explicitly, as graphs of random variables.

📖 All chapters | ← 11 · 🔮 Clustering & Unsupervised Learning | 13 · 🕸️ Probabilistic Graphical Models →