Chapter 04 — 🔥 Information Theory & Loss Functions — measuring surprise and error

📖 All chapters | ← 03 · 🎲 Probability & Statistics | 05 · 🧩 Core ML Concepts →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧩 Classical Machine Learning

🧠 Deep Learning

⚡ The Transformer Era

💬 Using & Adapting LLMs

🤖 The Agentic Frontier

🛠️ The Practical Toolkit

☁️ Cloud AI Platforms

28 · ☁️ Cloud AI Platforms — deploying foundation models on the hyperscalers

Chapter 03 gave us probability — the grammar of uncertainty. This chapter takes the next step: how do you measure uncertainty as a single number, and how does that number become the thing a model actually minimizes? Almost every loss function you will ever optimize traces back to one 1948 idea — Claude Shannon’s notion of information — and the bridge from “measuring surprise” to “training a classifier” is exactly what Chapter 05 picks up when it turns loss into the mechanics of optimization, bias-variance, and training.

📍 Timeline: 1948 — Claude Shannon’s A Mathematical Theory of Communication defines information and entropy; decades later, those same quantities become the loss functions that train every modern neural network.

4.1 — Information and surprise

Start with intuition: a surprise is how shocked you are when an event happens. “The sun rose this morning” — no surprise, it was certain. “I won the lottery” — huge surprise, it was rare. So surprise should be large for rare events and zero for certain ones. Information theory makes this precise: the surprise of an event is the log of one over its probability.

\[ I(x) = \log_2 \frac{1}{p(x)} = -\log_2 p(x) \]

If \(p(x) = 1\), surprise is \(0\). If \(p(x) = 1/2\), surprise is exactly \(1\) bit. Halve the probability again and you add one more bit — surprise grows logarithmically, not linearly.

Tip

Why a logarithm? Because independent surprises should add. Two coin flips have probability \(1/4\) together, and \(-\log_2(1/4) = 2\) bits — exactly \(1 + 1\). The log turns multiplying probabilities into adding bits, which is the whole reason it works.

Q: Why use \(-\log p\) and not just \(1/p\) to measure surprise? Both grow as \(p\) shrinks, but only the logarithm is additive for independent events: \(I(a,b) = -\log(p_a p_b) = -\log p_a - \log p_b\). Information from independent observations should pile up by addition, and \(1/p\) would multiply instead. The log is the unique function (up to a constant) with that property.

Q: What does the base of the log change? Only the unit. Base 2 gives bits, base \(e\) gives nats, base 10 gives dits. Deep learning code almost always uses natural log (nats) because it pairs cleanly with exponentials and gradients; the choice never changes which model wins, just the scale of the number.

Q: What is the surprise of a certain event? Zero. \(-\log_2(1) = 0\). If you already knew it would happen, observing it tells you nothing — and “information” is literally a measure of how much you learn from seeing the outcome.

4.2 — Entropy: average surprise

If surprise is per-event, entropy is the average surprise of a whole distribution — how uncertain you are before you look. Intuition: a fair coin (50/50) is maximally uncertain, so it has high entropy; a trick coin that lands heads 99% of the time is almost predictable, so low entropy. Entropy is just the expected value of surprise.

\[ H(p) = \mathbb{E}_{x \sim p}[-\log p(x)] = -\sum_x p(x) \log p(x) \]

For a binary variable, entropy peaks at \(p = 0.5\) and drops to \(0\) at \(p = 0\) or \(p = 1\). Below is that curve — uncertainty is highest when you have no idea which way it will go.

Q: In one sentence, what is entropy? Entropy is the average number of bits you need to encode an outcome drawn from a distribution — equivalently, your average surprise, equivalently how much uncertainty the distribution holds. High entropy = spread out and unpredictable; low entropy = peaked and predictable.

Q: Which distribution has the highest entropy? The uniform distribution (over a fixed finite set of outcomes) — every outcome equally likely means maximum uncertainty. The maximum-entropy distribution depends on the constraints: over the whole real line with fixed mean and variance, the Gaussian is the maximum-entropy choice; over the positive reals with fixed mean, it is the exponential. The “no extra assumptions beyond what you’re told” principle is exactly why the Gaussian shows up everywhere.

Q: How does entropy connect to compression? Shannon’s source coding theorem says you cannot losslessly compress data below its entropy, on average. A predictable source (low entropy) compresses well; a uniform random source (high entropy) barely compresses at all. This is why “information content” and “compressibility” are the same idea.

4.3 — Cross-entropy and KL divergence

Now the two stars of the show. Suppose the true distribution is \(p\) but you model it with \(q\). Cross-entropy asks: if you build your code assuming \(q\), how many bits do you actually pay when reality follows \(p\)? KL divergence asks the follow-up: how many extra bits did your wrong model cost you versus the perfect code?

\[ H(p, q) = -\sum_x p(x) \log q(x) \qquad D_{KL}(p \,\|\, q) = \sum_x p(x) \log \frac{p(x)}{q(x)} \]

The clean relationship that ties the chapter together:

\[ H(p, q) = H(p) + D_{KL}(p \,\|\, q) \]

Cross-entropy = the irreducible entropy of reality + the penalty for using the wrong model. Since \(H(p)\) is fixed by the data, minimizing cross-entropy is exactly minimizing KL divergence.

Tip

Mental model: \(H(p)\) is the cost of the best possible code. \(D_{KL}\) is the tax you pay for being wrong. Cross-entropy is the total bill. Training a model is haggling that tax down toward zero.

The two KL directions are not interchangeable, and the difference is something interviewers love to probe on a whiteboard. Against a bimodal target \(p\), forward KL \(D_{KL}(p\|q)\) forces a single-mode \(q\) to spread across both humps (it cannot afford zero probability where \(p\) is positive); reverse KL \(D_{KL}(q\|p)\) lets \(q\) snap onto one hump and ignore the other.

Forward KL D(p‖q): mass-covering q (wide) p

Reverse KL D(q‖p): mode-seeking q (narrow)

Q: Is KL divergence a distance? No — it is a divergence, not a metric. It is always \(\ge 0\) (Gibbs’ inequality) and equals \(0\) only when \(p = q\), but it is not symmetric: \(D_{KL}(p\|q) \ne D_{KL}(q\|p)\) in general, and it violates the triangle inequality. Calling it “KL distance” in an interview is a classic slip.

Q: What is the difference between forward and reverse KL? Forward KL \(D_{KL}(p\|q)\) is mean-seeking (mass-covering): it heavily punishes \(q\) for putting low probability where \(p\) is high, so \(q\) spreads to cover all of \(p\)’s modes (and may put mass in the empty valley between them). Reverse KL \(D_{KL}(q\|p)\) is mode-seeking: \(q\) prefers to lock onto one mode and stay sharp, because it is punished for putting mass where \(p\) is low. Variational inference uses reverse KL, which is why a fitted approximation can collapse to a single mode.

Q: Which KL does RLHF / DPO use, and why does mode-seeking matter? They use reverse KL: the regularizer is \(\beta\, D_{KL}(\pi_\theta \,\|\, \pi_{ref})\) — the trained policy in the first slot, the frozen reference in the second. Because reverse KL is mode-seeking, it lets the aligned model sharpen onto the high-reward modes the reference already supports while being strongly penalized for assigning mass to outputs the reference deemed near-impossible. That asymmetry is a feature: it keeps generations on the reference’s manifold (coherent, on-distribution) instead of smearing probability to cover everything. Chapter 19 goes deep on the alignment uses.

Q: Where else does KL divergence show up in modern LLMs? Two more big places. Pretraining: minimizing next-token cross-entropy is minimizing KL between the data distribution and the model. Distillation: the student minimizes KL to the teacher’s soft output distribution, learning the teacher’s full “dark knowledge” over classes, not just the top-1 label. Same tool, several jobs.

Q: What goes wrong if \(q(x) = 0\) where \(p(x) > 0\)? Cross-entropy and KL blow up to infinity — you assigned zero probability to something that actually happens, an infinitely surprising mistake. This is exactly why classifiers use a softmax (never outputs a hard zero) and why language models apply label smoothing or keep logits finite.

4.4 — Mutual information

A quick but important relative. Mutual information measures how much knowing one variable reduces your uncertainty about another — the overlap of information between \(X\) and \(Y\). Intuition: if knowing someone’s height tells you a lot about their shoe size, height and shoe size share high mutual information; if they are independent, it is zero.

\[ I(X; Y) = \sum_{x,y} p(x,y) \log \frac{p(x,y)}{p(x)\,p(y)} = D_{KL}\big(p(x,y) \,\|\, p(x)p(y)\big) \]

That second form is the punchline: mutual information is just the KL divergence between the joint distribution and the product of the marginals — a measure of how far \(X\) and \(Y\) are from being independent.

Q: When is mutual information zero? Exactly when \(X\) and \(Y\) are independent, because then \(p(x,y) = p(x)p(y)\) and the log term is \(\log 1 = 0\). Any statistical dependence — linear or nonlinear — makes it positive, which is what makes it stronger than correlation.

Q: How is mutual information different from correlation? Correlation only catches linear relationships; mutual information catches any dependence. Two variables on a circle have zero correlation but high mutual information. The cost is that MI needs the full joint distribution (or a good estimate of it), which is harder to compute.

4.5 — From information to loss functions

Here is the bridge the whole chapter has been building toward. A loss function scores how wrong a prediction is, and the model learns by pushing it down. The headline result: the “natural” losses are not arbitrary — they fall straight out of Maximum Likelihood Estimation (MLE), and the noise assumption you make about the data picks the loss.

The logic: maximizing the likelihood of the data is the same as minimizing the negative log-likelihood, and the negative log-likelihood of a categorical model is cross-entropy.

\[ \underbrace{\arg\max_\theta \prod_i q_\theta(y_i)}_{\text{MLE}} = \arg\min_\theta \underbrace{-\sum_i \log q_\theta(y_i)}_{\text{negative log-likelihood = cross-entropy}} \]

flowchart LR
  A["Maximum Likelihood"] --> B["minimize negative log-likelihood"]
  B --> C{"what is the model?"}
  C -->|"Gaussian noise"| D["MSE"]
  C -->|"Laplace noise"| E["MAE"]
  C -->|"Categorical / softmax"| F["Cross-Entropy"]

So MSE, MAE, and cross-entropy are the same principle under different noise assumptions. Pick your assumption about the data, and MLE hands you the loss.

Tip

Deriving MSE in one line. Assume targets are Gaussian around the prediction: \(y \sim \mathcal{N}(\hat y, \sigma^2)\). The negative log-likelihood of one point is \(-\log\!\big[\frac{1}{\sqrt{2\pi\sigma^2}}e^{-(y-\hat y)^2/2\sigma^2}\big] = \frac{1}{2\sigma^2}(y-\hat y)^2 + \text{const}\). Drop the constant and the \(\sigma\) factor and you are left with the squared error — MSE is Gaussian MLE. Swap the Gaussian for a Laplace density and the same move gives you the absolute error (MAE).

Q: Show cross-entropy loss for one example and a 3-line implementation. For a true label one-hot \(y\) and predicted probabilities \(\hat{y}\), the loss is \(L = -\sum_c y_c \log \hat{y}_c\). Because \(y\) is one-hot, this collapses to just \(-\log(\hat{y}_{\text{correct}})\) — the surprise of the right answer.

import numpy as np
def cross_entropy(y_true, y_pred):              # y_true one-hot, y_pred probs
    y_pred = np.clip(y_pred, 1e-12, 1.0)        # avoid log(0) -> inf
    return -np.sum(y_true * np.log(y_pred)) / len(y_true)  # mean over batch

Q: Why optimize cross-entropy instead of accuracy directly? Because accuracy is flat and non-differentiable — nudging a weight rarely flips a prediction, so the gradient is zero almost everywhere and gradient descent has nothing to follow. Cross-entropy is smooth: it keeps rewarding the model for becoming more confident in the right answer even when the prediction was already correct, giving a usable gradient everywhere. You optimize cross-entropy and report accuracy.

Warning

Common interview trap: “Why not just train on accuracy?” The answer is the gradient, not the accounting. Accuracy is your business metric; cross-entropy is your optimization surrogate. Confusing the two is a red flag.

Q: How does cross-entropy connect to the softmax, and why is its gradient \(\hat{y} - y\)? The network outputs raw scores (logits); softmax turns them into a probability distribution \(\hat y\), and cross-entropy compares \(\hat y\) to the true label \(y\). The clean gradient comes from two pieces canceling: the cross-entropy derivative \(-\partial \log \hat y_c\) brings a \(-1/\hat y_c\) term, and the softmax Jacobian contributes \(\hat y_i(\delta_{ic} - \hat y_c)\). Multiply them out across all classes and everything collapses to \(\partial L / \partial z_i = \hat{y}_i - y_i\) — predicted minus true, the same clean form linear regression has. That cancellation is half the reason this combo dominates classification.

Q: Should you pass probabilities or logits to the loss, and why? Pass the logits and use the fused loss (PyTorch CrossEntropyLoss / log_softmax, TF from_logits=True). Computing softmax and then taking its log separately is numerically dangerous: a large logit overflows in exp, and a tiny probability underflows to \(0\) so log(0) returns \(-\infty\). Frameworks fuse softmax + log into one log_softmax that uses the log-sum-exp trick (subtract the max logit first) to stay stable. The clipping in the numpy snippet above is a teaching crutch; the production answer is “never materialize the probabilities — let the fused loss work in logit space.”

4.6 — Regression and margin losses

Not everything is a probability over classes. For regression (predicting a number) the two staples are MSE and MAE, and the difference between them is all about outliers. Huber loss stitches the two together, and for margin-based classifiers like SVMs the hinge loss cares about a safety margin rather than a probability.

\[ \text{MSE} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2 \qquad \text{MAE} = \frac{1}{n}\sum_i |y_i - \hat{y}_i| \]

\[ \text{Hinge} = \frac{1}{n}\sum_i \max\big(0,\; 1 - y_i \cdot \hat{y}_i\big) \]

Loss	Penalizes	Behavior on outliers	Implied noise (MLE)	Used for
MSE (L2)	error squared	very sensitive (grows quadratically)	Gaussian	regression, smooth gradients
MAE (L1)	error absolute	robust (grows linearly)	Laplace	regression with outliers
Huber	squared near 0, linear in tails	robust, with smooth gradient	Gaussian-ish core, heavy tails	regression wanting both
Hinge	margin violations	linear penalty, like MAE	— (not probabilistic)	SVMs, max-margin
Cross-entropy	wrong-class surprise	n/a (probabilistic)	categorical	classification

Q: MSE vs MAE — which is robust to outliers and why? MAE is robust; MSE is not. MSE squares the error, so a single point that is 10 away contributes 100 — one outlier can dominate the whole loss and drag the fit toward it. MAE grows linearly, so an outlier’s influence is proportional, not explosive. The flip side: MSE has a smooth gradient everywhere, while MAE’s gradient is constant in magnitude and undefined at zero, which can make optimization jiggle near the minimum.

Q: Why does squared error give the mean and absolute error give the median? Set the derivative to zero. For MSE, \(\frac{d}{d\hat y}\sum (y_i - \hat y)^2 = -2\sum (y_i - \hat y) = 0 \Rightarrow \hat y = \frac{1}{n}\sum y_i\) — the mean is the point that balances the signed errors. For MAE, the subgradient of \(\sum |y_i - \hat y|\) counts \(+1\) for every point below \(\hat y\) and \(-1\) for every point above; that sum is zero only when equally many points sit on each side — the median. The median ignores how far outliers are (just which side), which is exactly why MAE shrugs them off.

Q: What is Huber loss and when do you reach for it? Huber is the best-of-both regression loss: quadratic for small residuals \(|y-\hat y| \le \delta\) and linear beyond \(\delta\). So near the optimum it behaves like MSE (smooth gradient, no jiggle), but in the tails it behaves like MAE (an outlier contributes a linear, bounded-slope penalty instead of an exploding squared one). The knob \(\delta\) sets where “ordinary error” ends and “outlier” begins. Reach for it when your data has occasional outliers but you still want clean gradients near the minimum.

Q: When would you actually prefer MSE? When you genuinely want large errors punished hard (a 10-unit miss really is more than twice as bad as a 5-unit miss), when the noise is roughly Gaussian, or when you just want the smooth, well-behaved gradients that make optimization easy. Remember its minimizer is the mean of the targets, so MSE chases the average and is pulled by outliers.

Q: What does the hinge loss actually reward? It rewards being correct with margin. As long as \(y_i \cdot \hat{y}_i \ge 1\) (right side, confidently), the loss is exactly zero — correctly-classified-with-margin points are ignored entirely, contributing no gradient. Inside the margin or on the wrong side it grows linearly (like MAE). This “stop caring once you’re safely right” behavior is what produces the max-margin boundary of an SVM, and it is the big philosophical contrast with cross-entropy, which never stops pushing for more confidence.

4.7 — Perplexity for language models

Finally, the metric you will hear in every LLM paper. Perplexity is just cross-entropy dressed up for language — and it has a wonderfully concrete meaning. Intuition: perplexity is how many equally-likely choices the model feels it is choosing between at each token. A perplexity of 1 means perfect prediction; a perplexity of 50 means the model is as confused as if it were guessing uniformly among 50 words.

\[ \text{Perplexity} = \exp\!\Big(\!-\frac{1}{N}\sum_i \ln q(x_i)\Big) = \exp\big(\text{cross-entropy in nats}\big) \]

It is literally the exponential of the average cross-entropy, and the log inside the cross-entropy must be the natural log (nats) for \(\exp\) to invert it cleanly. (If you measured cross-entropy in bits with \(\log_2\), you would raise \(2\) to that instead — base-matching is what makes the identity exact.) Because \(\exp\) is monotonic, minimizing perplexity and minimizing cross-entropy are the same optimization — perplexity is just the human-readable version.

Q: What does a perplexity of 20 mean intuitively? The model is, on average, as uncertain as if it were picking uniformly among 20 options for the next token. Lower is better: a perplexity near 1 means it almost always nails the next token; a perplexity equal to the vocabulary size means it learned nothing and is guessing uniformly.

Q: Why report perplexity instead of raw cross-entropy? Same information, friendlier scale. Cross-entropy in nats is abstract; perplexity maps it onto an interpretable “effective branching factor” — a number you can reason about as choices-per-token. It also makes models comparable at a glance, provided they share the same tokenizer and vocabulary.

Q: What is the catch when comparing perplexities across models? Perplexity is only comparable under the same tokenization and the same test set. A model with a bigger vocabulary or different tokenizer splits text into different units, so its per-token perplexity is not directly comparable — a subword model and a character model can report wildly different numbers on identical text. Always check the tokenizer before trusting a perplexity comparison.

Key takeaways

Surprise \(= -\log p(x)\): rare events are surprising, certain events carry zero information; logs make independent surprises add.
Entropy \(H(p) = -\sum p \log p\) is average surprise — uncertainty of a distribution; maximized by the uniform over a finite set, or by the Gaussian over the real line under fixed mean and variance.
Cross-entropy \(= H(p) + D_{KL}(p\|q)\): minimizing it is minimizing KL, since \(H(p)\) is fixed by the data.
KL divergence is non-negative, asymmetric, and not a distance. Forward KL is mass-covering, reverse KL is mode-seeking; pretraining minimizes data-vs-model KL, RLHF/DPO use reverse KL \(D_{KL}(\pi_\theta\|\pi_{ref})\) to stay on the reference manifold, and distillation matches the teacher.
Mutual information is the KL between the joint and the product of marginals — zero iff independent, and catches nonlinear dependence that correlation misses.
Losses come from Maximum Likelihood: Gaussian noise → MSE (its negative log-likelihood is squared error), Laplace noise → MAE, categorical → cross-entropy.
MAE is outlier-robust (minimizer = median); MSE punishes large errors harder (minimizer = mean, smooth gradients); Huber is quadratic near zero and linear in the tails to get both.
Hinge loss rewards a margin then stops (ignores confidently-correct points); cross-entropy never stops pushing for confidence.
Optimize cross-entropy, not accuracy — accuracy has no usable gradient; softmax + cross-entropy gives the clean \(\hat{y} - y\) gradient (softmax Jacobian and \(-\log\) cancel). Pass logits, not probabilities, and use the fused log_softmax for numerical stability.
Perplexity \(= \exp(\text{cross-entropy in nats})\): the effective number of choices per token, comparable only under the same tokenizer.
Next up — Chapter 05 takes these losses and asks how to actually minimize them: gradient descent and optimization, the bias-variance tradeoff, and the mechanics of training.

📖 All chapters | ← 03 · 🎲 Probability & Statistics | 05 · 🧩 Core ML Concepts →