Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 8.1 — Linear regression / OLS
  • 8.2 — Polynomial regression (and over/underfitting)
  • 8.3 — The bias–variance decomposition
  • 8.4 — Ridge / Lasso / Elastic Net
  • 8.5 — Logistic regression
  • 8.6 — Generalized Linear Models
  • 8.7 — Regression evaluation metrics
  • 8.8 — Quantile and robust regression
  • 8.9 — Generalized Linear Models from the Exponential Family
  • 8.10 — Quick reference
  • 8.11 — Key takeaways
  • 8.12 — See also

Chapter 08 — 📈 Regression

📖 All chapters  |  ← 07 · 🗜️ Dimensionality Reduction  |  09 · 📐 Classification Algorithms →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

Regression is the corner of machine learning that answers “how much?” rather than “which class?” — predicting a continuous number (a price, a temperature, a risk score) from input features by fitting a function to data. It is the oldest, simplest, and most interpretable family of supervised models, and almost every more elaborate method — neural networks included — has a regression layer hiding somewhere inside it. Master this chapter and you own the vocabulary (coefficients, residuals, regularization, link functions) that the rest of the field reuses everywhere.

🧭 In context: Classical supervised learning · used to predict continuous values and model relationships between variables · the one key idea is fit a function by minimizing a loss between predictions and truth, then control its complexity.

💡 Remember this: Every model here fits a weighted sum of features by minimizing a chosen loss, then trades a little bias for less variance to generalize — that single recipe (loss + regularization) underlies all of regression.

parameters (β) → loss best fit

Every model in this chapter, at heart, rolls a ball downhill on a loss surface like this one — the whole job is choosing what the loss measures and how much curvature (complexity) to allow.

8.1 — Linear regression / OLS

Linear regression assumes the target \(y\) is a straight-line (affine) combination of the inputs: each feature gets a weight, you multiply and add, plus an intercept. With one feature it is the line \(y = \beta_0 + \beta_1 x\) you drew in school; with \(p\) features it is a flat plane (a hyperplane) in higher dimensions.

The intuition: imagine scattering points on paper and laying a stiff ruler through them so the ruler sits “as close as possible” to all of them at once. Ordinary Least Squares (OLS) makes “close” precise — it picks the line that minimizes the sum of squared vertical gaps between each point and the line. We square so that positive and negative misses do not cancel, and so that big misses hurt disproportionately.

Formally, with a data matrix \(X\) (one row per example, a leading column of 1s for the intercept) and target vector \(y\), we choose coefficients \(\beta\) to minimize the residual sum of squares:

\[ \text{RSS}(\beta) = \sum_{i=1}^{n} (y_i - x_i^\top \beta)^2 = \lVert y - X\beta \rVert^2 \]

In words: add up the squared vertical misses between every point and the line, and pick the line that makes that total as small as possible.

Also written: \(\text{RSS}(\beta) = (y - X\beta)^\top (y - X\beta)\) — the same sum of squares expressed as a single dot product of the residual vector with itself.

A residual is the leftover error \(y_i - \hat{y}_i\) for one point — what the model failed to explain.

Here is a fitted line over five points, with the residuals drawn as the vertical sticks OLS is shrinking:

x (feature) y fitted line residual

Two ways to solve it. Because RSS is a smooth bowl-shaped (convex) function, it has exactly one bottom, and we can reach it two ways.

The normal equation sets the gradient to zero and solves in closed form:

\[ \hat{\beta} = (X^\top X)^{-1} X^\top y \]

In words: the best coefficients are found by “undoing” the feature correlations (\(X^\top X\)) applied to how each feature lines up with the target (\(X^\top y\)).

Also written: \(\hat{\beta} = X^{+} y\), where \(X^{+} = (X^\top X)^{-1}X^\top\) is the Moore–Penrose pseudoinverse — the same solution stated as a single projection of \(y\) onto the column space of \(X\) (see Linear Algebra for the projection geometry).

This is exact and needs no tuning, but inverting \(X^\top X\) costs roughly \(O(p^3)\) and the matrix can be ill-conditioned when features are collinear. It shines for small-to-medium \(p\).

Gradient descent instead walks downhill: start anywhere, repeatedly nudge \(\beta\) opposite the gradient \(-2X^\top(y - X\beta)\) scaled by a learning rate. It costs \(O(np)\) per step and scales to millions of features, but you must choose the step size and number of iterations. (See Optimization for the full theory.)

import numpy as np
# tiny dataset: hours studied -> exam score
X = np.array([[1,1],[1,2],[1,3],[1,4],[1,5]])  # col of 1s = intercept
y = np.array([52, 60, 67, 76, 84], dtype=float)

# --- normal equation ---
beta = np.linalg.inv(X.T @ X) @ X.T @ y
print(beta)            # [44.4  7.9]  ->  score ≈ 44.4 + 7.9*hours
print(X @ beta - y)    # residuals, should be small

The fitted \(\hat\beta \approx [44.4, 7.9]\) reads cleanly: a student studying zero hours is predicted ~44, and each coefficient is the change in \(y\) per one-unit change in that feature, holding the others fixed — here +7.9 points per extra hour.

In practice you reach for a library rather than hand-rolling the algebra. The scikit-learn equivalent handles the intercept, numerical stability, and prediction in three lines:

from sklearn.linear_model import LinearRegression
import numpy as np

hours = np.array([1,2,3,4,5]).reshape(-1, 1)   # no manual 1s column needed
score = np.array([52, 60, 67, 76, 84.])

model = LinearRegression().fit(hours, score)
print(model.intercept_, model.coef_)   # ~44.4, [~7.9]
print(model.predict([[6]]))            # extrapolate to 6 hours -> ~91.8

Assumptions (the fine print that makes OLS valid and its uncertainty estimates trustworthy): linearity of the relationship; independent errors; homoscedasticity (constant error variance across the range — not a fan that widens); and roughly normal errors for inference. Violate them and the line may still predict, but p-values and confidence intervals lie.

The picture to keep in your head: good residuals form a featureless horizontal band; a widening fan (heteroscedasticity) or a curved trend means an assumption is broken and your error bars cannot be trusted.

healthy: flat band ✓ widening fan ✗
Tip

Always standardize features before comparing coefficient magnitudes. On raw scales, a coefficient of 0.001 on “salary in dollars” may matter far more than a coefficient of 5 on “number of kids.”

8.2 — Polynomial regression (and over/underfitting)

What if the relationship curves? Polynomial regression keeps the model linear in its coefficients but feeds it nonlinear features: instead of just \(x\), give it \(x, x^2, x^3, \dots, x^d\). The math is identical OLS — only the columns of \(X\) change — yet the fitted curve can now bend.

The intuition is a trade. A degree-1 line is too stiff; if the truth wiggles, the line systematically misses — this is underfitting (high bias, the model is too simple to capture the pattern). Crank the degree to 15 and the curve threads every training point exactly, including the noise — overfitting (high variance, the model memorizes quirks that will not repeat). The art is the middle.

graph LR
  A[Degree too low<br/>underfit · high bias] --> B[Right degree<br/>captures signal]
  B --> C[Degree too high<br/>overfit · high variance]
  style B fill:#c6f6d5,stroke:#22543d
  style A fill:#fed7d7,stroke:#742a2a
  style C fill:#fed7d7,stroke:#742a2a

The picture below shows the same noisy points fit three ways — the stiff line, a sensible curve, and a wild high-degree curve that chases every dot:

underfit good overfit

The fix is not “never use high degree” but “test on data the model never saw.” Choose the degree by cross-validation; the right complexity is the one that minimizes error on a held-out set, not on the training set. (Full treatment in Model Evaluation & Tuning.)

In scikit-learn, the clean way to do polynomial regression is to chain feature expansion and a linear fit in a Pipeline, so the squared/cubed columns are generated automatically and consistently at predict time:

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np

x = np.linspace(0, 1, 30).reshape(-1, 1)
y = np.sin(2*np.pi*x).ravel() + 0.1*np.random.randn(30)

model = make_pipeline(PolynomialFeatures(degree=4), LinearRegression())
model.fit(x, y)
print(model.predict([[0.25]]))   # curve can now bend to follow the sine
Warning

A polynomial that fits training data perfectly is a red flag, not a trophy. Zero training error with a degree near \(n-1\) almost always means you have fit noise and will generalize terribly.

8.3 — The bias–variance decomposition

Sections 8.1–8.2 kept appealing to “bias” and “variance” as the two enemies of a good fit. Here is the precise statement that names them — the single most quoted equation for why models fail to generalize, and the lens through which every regularizer, ensemble, and cross-validation choice in this book is justified.

The intuition: imagine refitting your model on many different random training sets drawn from the same world, then asking how it does on one fixed test point. Two things can go wrong. The model’s average prediction across all those fits can sit off-target (bias — it is systematically wrong, like a bathroom scale that always reads 2 kg light). Or the predictions can scatter wildly from fit to fit (variance — the scale gives a different number each time you step on it). On top of both sits irreducible noise the data itself carries, which no model can ever remove.

The dartboard picture is the fastest way to lock this in: bias is how far the cluster of shots sits from the bullseye; variance is how spread out the cluster is. You want both small — but the two are usually in tension.

low bias low var ✓ low bias high var high bias low var high bias high var ✗

For squared-error loss, the expected test error at a point \(x\) decomposes exactly into those three pieces:

\[ \mathbb{E}\big[(y - \hat f(x))^2\big] = \underbrace{\big(\text{Bias}[\hat f(x)]\big)^2}_{\text{systematic miss}} + \underbrace{\text{Var}[\hat f(x)]}_{\text{instability}} + \underbrace{\sigma^2}_{\text{irreducible noise}} \]

In words: how wrong you expect to be on a fresh point equals how far off your model is on average, squared, plus how much your model jiggles between training sets, plus the noise baked into the data that nobody can predict away.

Also written: with \(\text{Bias}[\hat f(x)] = \mathbb{E}[\hat f(x)] - f(x)\), the formula is \(\big(\mathbb{E}[\hat f(x)] - f(x)\big)^2 + \mathbb{E}\big[(\hat f(x) - \mathbb{E}[\hat f(x)])^2\big] + \sigma^2\) — the same three terms written out as a squared average-miss, an expected squared deviation from the average, and the noise floor.

The decomposition explains the whole over/underfitting story as a tug-of-war: simple models (low-degree polynomials, heavy regularization) have high bias, low variance; flexible models (high degree, no penalty) have low bias, high variance. Total error is a U-shaped curve, and the sweet spot is the bottom of the U — never zero, because \(\sigma^2\) is a hard floor.

model complexity → error bias² variance total error sweet spot

A small worked feel for the numbers: suppose at some test point your model is biased by \(+3\) (predicts 3 too high on average), its predictions vary with standard deviation \(4\) (variance \(16\)), and the data noise has \(\sigma^2 = 5\). Expected squared error \(= 3^2 + 16 + 5 = 9 + 16 + 5 = 30\). Halve the variance (e.g. by adding mild regularization, accepting bias rising to \(4\)) and you get \(4^2 + 8 + 5 = 29\) — a net win even though you made the model more biased. That single trade is the entire justification for Ridge, Lasso, and bagging (the variance-cutting workhorse of Ensemble Methods).

import numpy as np
from sklearn.tree import DecisionTreeRegressor

# Estimate bias/variance empirically by refitting on many bootstrap samples.
rng = np.random.default_rng(0)
f = lambda x: np.sin(2*np.pi*x)            # true function
x_test = np.array([[0.4]])
preds = []
for _ in range(200):
    x = rng.random((40, 1))
    y = f(x).ravel() + 0.1*rng.standard_normal(40)
    preds.append(DecisionTreeRegressor(max_depth=8).fit(x, y).predict(x_test)[0])

preds = np.array(preds)
bias2 = (preds.mean() - f(x_test)[0,0])**2
var   = preds.var()
print(round(bias2, 4), round(var, 4))   # deep tree: tiny bias, large variance
Tip

You cannot measure bias and variance from a single train/test split — they are defined over the distribution of training sets. In practice cross-validation estimates their sum (the U-curve), and you tune complexity to minimize it. Lowering one term usually raises the other; the goal is the smallest total, not zero of either.

8.4 — Ridge / Lasso / Elastic Net

When you have many features (especially correlated ones), plain OLS coefficients can blow up to huge, unstable values — flip one data point and they swing wildly. Regularization fights this by adding a penalty on coefficient size to the loss, telling the model “fit the data, but keep yourself small.” This deliberately accepts a little bias to cut a lot of variance — exactly the trade the previous section made precise.

Ridge regression (L2) penalizes the sum of squared coefficients:

\[ \min_\beta \; \lVert y - X\beta \rVert^2 + \lambda \sum_j \beta_j^2 \]

In words: fit the data well and keep the coefficients small at the same time, with \(\lambda\) deciding how much you care about smallness versus fit.

Also written: \(\min_\beta \lVert y - X\beta\rVert_2^2 + \lambda\lVert\beta\rVert_2^2\) — the penalty is the squared L2 norm of the coefficient vector.

The strength \(\lambda \ge 0\) is a knob: \(\lambda = 0\) recovers OLS, large \(\lambda\) shrinks all coefficients smoothly toward zero (but never exactly to zero). Ridge even has a closed form, \(\hat\beta = (X^\top X + \lambda I)^{-1}X^\top y\) — the added \(\lambda I\) also cures the non-invertibility that breaks OLS under collinearity.

Lasso (L1) penalizes the sum of absolute values, \(\lambda \sum_j |\beta_j|\). The kink in the absolute value at zero does something Ridge cannot: it drives many coefficients to exactly zero, performing automatic feature selection — the model hands you a shorter, sparser equation (a lightweight cousin of the methods in Dimensionality Reduction).

In words: Lasso adds up the raw sizes (no squaring) of the coefficients and penalizes that, which has the side effect of switching weak features fully off.

Also written: the full objective is \(\min_\beta \lVert y - X\beta\rVert_2^2 + \lambda\lVert\beta\rVert_1\), where \(\lVert\beta\rVert_1 = \sum_j|\beta_j|\) is the L1 norm.

Why the geometry differs — in plain terms: regularizing is like fitting the line on a budget. The error surface still wants to slide toward the OLS solution, but you can only spend so much “coefficient size.” Ridge’s budget is a round coin; the error contour rolls onto it and stops at a slanted point where both coefficients are still nonzero. Lasso’s budget is a diamond with sharp corners poking out along the axes; the contour almost always first touches one of those corners — and a corner sits exactly where one coefficient is zero. That single shape difference is the whole reason Lasso zeroes features and Ridge merely shrinks them:

Ridge (L2): round → both ≠ 0 Lasso (L1): corner → one = 0

Elastic Net simply adds both penalties, \(\lambda(\alpha \sum|\beta_j| + (1-\alpha)\sum \beta_j^2)\). It gets Lasso’s sparsity while handling groups of correlated features more gracefully (Lasso alone tends to pick one of a correlated group arbitrarily; Elastic Net keeps them together).

In words: mix a little of both penalties — the L1 part still zeroes weak features, the L2 part stops correlated features from being dropped at random.

Also written: \(\min_\beta \lVert y - X\beta\rVert_2^2 + \lambda\big(\alpha\lVert\beta\rVert_1 + (1-\alpha)\lVert\beta\rVert_2^2\big)\), with the mixing parameter \(\alpha\in[0,1]\) sliding from pure Ridge (\(\alpha=0\)) to pure Lasso (\(\alpha=1\)).

import numpy as np
np.random.seed(0)
X = np.random.randn(50, 6)
true = np.array([3., 0, 0, -2., 0, 0])      # only 2 features matter
y = X @ true + 0.3*np.random.randn(50)

def ridge(X, y, lam):
    p = X.shape[1]
    return np.linalg.solve(X.T@X + lam*np.eye(p), X.T@y)

print(ridge(X, y, 0.0).round(2))   # OLS: noisy on the zero features
print(ridge(X, y, 10.0).round(2))  # shrunk toward 0, smaller & stable

The same experiment in scikit-learn shows Lasso’s killer feature — it sets the four useless coefficients to exactly zero, recovering the sparse truth, and LassoCV even picks \(\lambda\) for you by cross-validation:

from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# standardize first — penalties are scale-sensitive
model = make_pipeline(StandardScaler(), LassoCV(cv=5)).fit(X, y)
print(model[-1].coef_.round(2))   # ~[3, 0, 0, -2, 0, 0]: junk features zeroed
print(model[-1].alpha_)           # the λ chosen by cross-validation
Tip

Rule of thumb: reach for Lasso when you suspect most features are useless and want an interpretable subset; reach for Ridge when many features each contribute a little and are correlated; reach for Elastic Net when you want both and are unsure.

Which penalty should I use? This chooser folds the rule of thumb above into a quick decision path:

flowchart TD
  A["Coefficients unstable<br/>or too many features?"] -->|no| OLS["plain OLS — no penalty"]
  A -->|yes| B{"Want a sparse model<br/>(features switched off)?"}
  B -->|no| Ridge["Ridge (L2)<br/>shrink all, keep all"]
  B -->|yes| C{"Features correlated<br/>in groups?"}
  C -->|no| Lasso["Lasso (L1)<br/>zero out weak features"]
  C -->|yes| EN["Elastic Net<br/>sparse + group-stable"]
  style OLS fill:#bee3f8,stroke:#2b6cb0
  style Ridge fill:#c6f6d5,stroke:#22543d
  style Lasso fill:#fed7d7,stroke:#742a2a
  style EN fill:#fefcbf,stroke:#744210

Warning

Regularization penalties depend on feature scale, so you must standardize features first (see Data Preprocessing) — otherwise the penalty silently punishes large-unit features (like “income in dollars”) far more than small-unit ones. Also, never penalize the intercept.

8.5 — Logistic regression

Despite the name, logistic regression is a classification algorithm. The trick: instead of predicting a class directly, it predicts the probability of the positive class, then thresholds. To turn the unbounded linear score \(z = x^\top\beta\) (which can be any real number) into a probability in \([0,1]\), it passes \(z\) through the sigmoid (logistic) function:

\[ \sigma(z) = \frac{1}{1 + e^{-z}}, \qquad \hat p = \sigma(x^\top\beta) \]

In words: take the linear score, and squash it smoothly so that very negative scores become near-0 probabilities and very positive scores become near-1 probabilities.

Also written: \(\sigma(z) = \frac{e^{z}}{1 + e^{z}}\) — multiply top and bottom by \(e^{z}\) to get the equivalent form often seen in textbooks.

The sigmoid is an S-curve: very negative \(z \to 0\), very positive \(z \to 1\), and \(z=0 \to 0.5\).

0.5 at z=0 p→1 p→0 z = xᵀβ

The log-odds view makes the coefficients interpretable. Rearranging the sigmoid gives \(\log\frac{p}{1-p} = x^\top\beta\). So the linear part predicts the log-odds of the positive class, and each coefficient \(\beta_j\) is the change in log-odds per unit of feature \(j\); exponentiating, \(e^{\beta_j}\) is the odds ratio — multiply the odds by that factor per unit. A \(\beta_j = 0.7\) means odds roughly double (\(e^{0.7}\approx 2\)) per unit increase.

The decision boundary is where \(\hat p = 0.5\), i.e. \(x^\top\beta = 0\) — a straight line (hyperplane) in feature space. Logistic regression is therefore a linear classifier: points on one side are predicted positive, the other side negative.

We do not fit it with least squares; we maximize the likelihood of the observed labels (see Probability & Statistics), equivalently minimizing the cross-entropy (log) loss — the same loss that trains neural network classifiers — which is convex and solved by gradient descent:

\[ \mathcal{L}(\beta) = -\sum_i \big[ y_i \log \hat p_i + (1-y_i)\log(1-\hat p_i) \big] \]

In words: punish the model by how surprised it is at the true labels — a confident wrong prediction (saying 0.99 when the answer was 0) is penalized enormously, a confident right one barely at all.

Also written: \(\mathcal{L}(\beta) = -\sum_i \log \hat p_i^{(y_i)}\), where \(\hat p_i^{(y_i)}\) is the probability the model assigned to the correct class — the negative log-likelihood of the data.

import numpy as np
def sigmoid(z): return 1/(1+np.exp(-z))

# hours studied -> pass(1)/fail(0)
X = np.c_[np.ones(6), [1,2,3,4,5,6]]
y = np.array([0,0,0,1,1,1.])
beta = np.zeros(2)
for _ in range(5000):
    p = sigmoid(X @ beta)
    grad = X.T @ (p - y)          # gradient of cross-entropy
    beta -= 0.1 * grad            # gradient descent step
print(beta.round(2))             # boundary where xᵀβ=0
print(sigmoid(X @ beta).round(2))  # predicted pass-probabilities

The gradient \(X^\top(\hat p - y)\) is strikingly the same shape as linear regression’s — predicted minus actual, times the inputs. That elegant parallel is no accident; both are members of the next section’s family.

In production you would call scikit-learn, which adds L2 regularization by default, multi-class support, and calibrated probabilities:

from sklearn.linear_model import LogisticRegression
import numpy as np

hours = np.array([1,2,3,4,5,6]).reshape(-1, 1)
passed = np.array([0,0,0,1,1,1])

clf = LogisticRegression().fit(hours, passed)
print(clf.predict_proba([[3.5]]))   # P(fail), P(pass) at the boundary region
print(clf.coef_, clf.intercept_)    # slope = log-odds per study hour

A concrete odds-ratio read. Suppose a credit-default model gives feature “missed payments last year” a coefficient of \(\beta = 0.7\). Then \(e^{0.7}\approx 2.0\): each extra missed payment doubles the odds of default. Going from 0.7 to 1.4 (two coefficients’ worth) multiplies the odds by \(e^{1.4}\approx 4\). This multiplicative-on-odds reading is why logistic coefficients are so loved in medicine and credit scoring — you report “doubles the odds,” not an abstract slope.

Warning

Logistic regression outputs probabilities, but with imbalanced classes the default 0.5 threshold can be terrible. Tune the threshold to your costs, and judge with ROC-AUC or precision/recall, not raw accuracy (see Classification Algorithms and Model Evaluation & Tuning).

8.6 — Generalized Linear Models

Linear and logistic regression look different but share a skeleton. Generalized Linear Models (GLMs) name that skeleton: every GLM keeps a linear predictor \(\eta = x^\top\beta\), then connects it to the mean of the target through a link function \(g\), with the target drawn from a chosen distribution suited to its type.

\[ g(\mathbb{E}[y]) = x^\top\beta \]

In words: run the features through a weighted sum as usual, but first pass the target’s mean through a transform \(g\) so the unbounded linear scale can match a probability, a count, or any constrained quantity.

Also written: \(\mathbb{E}[y] = g^{-1}(x^\top\beta)\) — the same relationship solved for the mean, where \(g^{-1}\) (the inverse link or mean function) maps the linear score back to the response scale.

The link function is the adapter that maps the constrained mean (a probability in \([0,1]\), a count \(\ge 0\)) onto the unbounded linear scale. Pick the distribution to match your target, and the link follows naturally:

Target type Distribution Link \(g\) Model name
Continuous, real Gaussian identity Linear regression
Binary 0/1 Bernoulli logit \(\log\frac{p}{1-p}\) Logistic regression
Counts 0,1,2,… Poisson log Poisson regression
Positive, skewed Gamma log / inverse Gamma regression

Poisson regression is the everyday workhorse for counts — number of website visits, insurance claims, defects per batch. Because counts cannot be negative, it models \(\log \mathbb{E}[y] = x^\top\beta\), so \(\mathbb{E}[y] = e^{x^\top\beta}\) is always positive. A coefficient now reads multiplicatively: \(e^{\beta_j}\) is the factor by which the expected count multiplies per unit of feature \(j\) — e.g. \(\beta=0.4\) means \(e^{0.4}\approx 1.5\), a 50% increase in expected events.

A concrete use: an insurer modeling claim counts per policy-year against driver age and mileage uses Poisson regression precisely because most policies file 0 or 1 claims, the data is right-skewed, and predictions must stay nonnegative — an ordinary linear fit would happily predict \(-0.3\) claims.

graph LR
  X[features x] --> L["linear predictor<br/>η = xᵀβ"]
  L --> G{"link g⁻¹"}
  G -->|identity| M1[mean of y · Gaussian]
  G -->|sigmoid| M2[probability · Bernoulli]
  G -->|exp| M3[rate · Poisson]

The payoff of the GLM view is unification: one fitting procedure (maximum likelihood, solved by iteratively reweighted least squares), one mental model, swap the distribution and link to fit the data you actually have. It is the bridge from this chapter’s specific models to a single principled framework.

Swapping models in code is just as cheap as swapping them in theory — scikit-learn exposes PoissonRegressor, GammaRegressor, and TweedieRegressor with the same API as LinearRegression:

from sklearn.linear_model import PoissonRegressor
import numpy as np

X = np.array([[1.],[2.],[3.],[4.],[5.]])
y = np.array([1, 1, 2, 4, 8])             # counts

glm = PoissonRegressor(alpha=0.0).fit(X, y)
print(glm.coef_, glm.intercept_)          # slope read multiplicatively
print(glm.predict([[6]]))                 # expected count, guaranteed > 0
Tip

Choosing a GLM is mostly choosing the right distribution for your target: bounded 0/1 → Bernoulli/logit; nonnegative counts → Poisson/log; strictly positive skewed amounts (costs, durations) → Gamma. The link then almost picks itself.

8.7 — Regression evaluation metrics

A fitted model is only as good as its error on data it has not seen — but “error” has several flavors, each telling a different story. Suppose a model predicts house prices with these five test cases (truth vs. prediction, in $k):

Truth \(y\) Pred \(\hat y\) Error \(y-\hat y\)
200 190 10
250 270 −20
300 295 5
280 250 30
320 330 −10

MSE (Mean Squared Error) averages the squared errors: \(\frac1n\sum(y_i-\hat y_i)^2 = \frac{100+400+25+900+100}{5} = 305\). Squaring punishes big misses hard, so MSE is dominated by the worst predictions — good when large errors are especially costly, but sensitive to outliers, and its units are dollars-squared, which is hard to feel.

In words: square every miss, average them; one big blunder counts far more than several small ones.

Also written: \(\text{MSE} = \frac1n\lVert y - \hat y\rVert_2^2\) — the squared length of the error vector, divided by the number of points.

RMSE (Root MSE) takes the square root, \(\sqrt{305} \approx 17.5\), returning to the original units (k$). It is the most quoted regression metric: roughly “typical error size,” in the units you care about, still outlier-sensitive.

MAE (Mean Absolute Error) averages absolute errors: \(\frac{10+20+5+30+10}{5} = 15\). It treats all errors proportionally, so it is robust to outliers and reads as “on average we are off by $15k.” When MAE ≪ RMSE, a few large errors are inflating RMSE.

\(R^2\) (coefficient of determination) rescales error into “fraction of variance explained,” from 1 (perfect) down through 0 (no better than always predicting the mean \(\bar y\)) and even negative (worse than the mean):

\[ R^2 = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y)^2} = 1 - \frac{\text{SS}_{\text{res}}}{\text{SS}_{\text{tot}}} \]

In words: compare your model’s leftover error against the error of a dumb baseline that always guesses the average; \(R^2\) is the fraction of that baseline error you managed to remove.

Also written: \(R^2 = 1 - \frac{\text{MSE}_{\text{model}}}{\text{Var}(y)}\) — dividing top and bottom by \(n\) turns the sums into the model’s mean squared error over the variance of the target.

With \(\bar y = 270\), \(\text{SS}_{\text{tot}} = 70^2+20^2+30^2+10^2+50^2 = 9400\) and \(\text{SS}_{\text{res}}=1525\), so \(R^2 = 1 - 1525/9400 \approx 0.84\) — the model explains about 84% of the variation in price.

This little visual maps the four metrics onto the same five test points — each metric is just a different way to summarize the red error bars:

10 20 5 30 10 MAE=15 RMSE≈17.5 (pulled up by the 30) R²≈0.84 = fraction of price-variance explained

Adjusted \(R^2\) fixes a flaw: plain \(R^2\) never decreases when you add a feature, even a useless random one, so it tempts overfitting. Adjusted \(R^2\) docks you for each predictor \(p\) relative to sample size \(n\):

\[ R^2_{\text{adj}} = 1 - (1-R^2)\frac{n-1}{n-p-1} \]

In words: start from \(R^2\), then pay a tax for every feature you added — if a feature did not improve the fit enough to cover its tax, the adjusted score drops.

Also written: \(R^2_{\text{adj}} = 1 - \frac{\text{SS}_{\text{res}}/(n-p-1)}{\text{SS}_{\text{tot}}/(n-1)}\) — the same formula seen as \(1\) minus a ratio of variance estimates that each divide by their degrees of freedom.

It rises only if the new feature earns its keep, making it the honest metric for comparing models of different size.

import numpy as np
y    = np.array([200,250,300,280,320.])
yhat = np.array([190,270,295,250,330.])
err  = y - yhat
mse  = np.mean(err**2);   rmse = np.sqrt(mse)
mae  = np.mean(np.abs(err))
r2   = 1 - np.sum(err**2)/np.sum((y-y.mean())**2)
print(round(mse,1), round(rmse,2), mae, round(r2,3))  # 305.0 17.46 15.0 0.838

scikit-learn ships all of these so you do not re-derive them, and they double as scoring functions for cross-validation and grid search:

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

y    = np.array([200,250,300,280,320.])
yhat = np.array([190,270,295,250,330.])
print(mean_squared_error(y, yhat))                 # 305.0
print(mean_squared_error(y, yhat) ** 0.5)          # RMSE ≈ 17.46
print(mean_absolute_error(y, yhat))                # 15.0
print(r2_score(y, yhat))                           # ≈ 0.838
Tip

Report at least two metrics: an error in real units (RMSE or MAE) for “how wrong on average,” and \(R^2\) for “how much better than guessing the mean.” The gap between RMSE and MAE tells you how outlier-driven your errors are.

Warning

\(R^2\) on the training set always flatters the model and rises with every added feature — never select models by training \(R^2\). Compare on held-out data, and use adjusted \(R^2\) or cross-validated error when models differ in number of features.

8.8 — Quantile and robust regression

Everything so far has quietly assumed two things: that you want to predict the average outcome, and that a few wild data points will not wreck the fit. Both assumptions break in the real world. A delivery service does not care about the average delivery time — it cares that 95% of orders arrive within the promised window. And one fat-fingered data-entry error (a $2,000,000 house typed as $20,000,000) can yank an OLS line clean off the cloud of honest points, because squaring makes that one residual dominate the whole loss. Two close cousins of linear regression fix these.

Quantile regression predicts a chosen percentile instead of the mean. Where OLS minimizes squared error (which targets the mean), quantile regression minimizes an asymmetric, tilted absolute loss called the pinball loss that targets the \(\tau\)-th quantile:

\[ \mathcal{L}_\tau(r) = \begin{cases} \tau \, r & \text{if } r \ge 0 \\ (\tau - 1)\, r & \text{if } r < 0 \end{cases}, \qquad r = y - \hat y \]

In words: charge a price \(\tau\) for every unit you predict too low and a price \(1-\tau\) for every unit you predict too high; the line that minimizes the total bill sits at the \(\tau\)-th quantile.

Also written: \(\mathcal{L}_\tau(r) = r\big(\tau - \mathbb{1}[r < 0]\big) = \max\big(\tau r,\ (\tau-1)r\big)\) — the same V-shaped loss with its two arms tilted by \(\tau\).

Set \(\tau = 0.5\) and the two penalties are equal: you recover ordinary least-absolute-deviations regression, whose target is the median. Set \(\tau = 0.9\) and underestimates cost nine times as much as overestimates, so the fit rides high — exactly the line you want for “the time 90% of deliveries beat.” Fitting the same data at several \(\tau\) values traces out a fan of prediction bands without ever assuming the noise is Gaussian:

τ=0.9 τ=0.5 (median) τ=0.1

Robust regression keeps the goal of predicting the mean but changes the loss so outliers stop dominating. The most famous choice is the Huber loss: squared error for small residuals (so it behaves like OLS near the bulk of the data), switching to linear error once a residual exceeds a threshold \(\delta\) (so a giant outlier contributes its size, not its size squared):

\[ \mathcal{L}_\delta(r) = \begin{cases} \tfrac12 r^2 & |r| \le \delta \\ \delta\,(|r| - \tfrac12\delta) & |r| > \delta \end{cases} \]

In words: treat ordinary misses with the usual squared penalty, but for any miss bigger than \(\delta\) stop squaring and grow only linearly — so one monster outlier can no longer hijack the fit.

Also written: \(\mathcal{L}_\delta(r) = \tfrac12\min(r^2,\ 2\delta|r| - \delta^2)\) — a single expression that automatically picks the smaller of the quadratic and the linear arm.

The two loss shapes side by side make the difference obvious — squared error rockets upward on big residuals, Huber flattens to a constant slope:

squared (OLS) Huber residual r
from sklearn.linear_model import QuantileRegressor, HuberRegressor
import numpy as np

rng = np.random.default_rng(0)
X = np.sort(rng.random((40, 1)), axis=0)
y = (3*X.ravel() + rng.standard_normal(40)*0.3)
y[5] += 8.0      # inject a wild outlier

# robust: outlier barely moves the line vs. plain OLS
print(HuberRegressor().fit(X, y).coef_)              # slope stays near 3

# quantile: predict the 90th-percentile outcome, not the mean
print(QuantileRegressor(quantile=0.9, alpha=0).fit(X, y).coef_)
Tip

Use quantile regression when the spread matters — service-level targets (“p95 latency”), demand planning (“stock for the high-demand day”), or any prediction interval you must guarantee. Use robust regression when you trust the shape of the relationship but not the cleanliness of every data point.

Warning

Robustness is not free. Huber’s threshold \(\delta\) must be chosen (scikit-learn’s epsilon defaults to 1.35, calibrated for Gaussian noise), and if “outliers” are actually a real second regime in your data, downweighting them throws away signal. Always plot residuals before deciding a point is noise.

8.9 — Generalized Linear Models from the Exponential Family

Up to now the regression models in this chapter have looked like separate inventions. Ordinary least squares assumes the response is a continuous number with constant-variance Gaussian noise. Logistic regression assumes the response is a 0/1 label. Poisson regression — the natural tool for counts — has its own likelihood again. It is tempting to memorize each as its own recipe.

The deep insight of generalized linear models (GLMs) is that all three are the same model seen from three angles. They differ only in which member of a single family of probability distributions — the exponential family — you plug in for the response. Once you see the family, you get one fitting algorithm, one notion of “the right link,” and one diagnostic story that covers Gaussian, Bernoulli, Poisson, Gamma, and more.

The intuition: a GLM keeps the familiar linear predictor \(\eta = \mathbf{x}^\top\boldsymbol\beta\) — a weighted sum of features — but inserts two flexible slots. First, the response can be drawn from any exponential-family distribution, not just a Gaussian. Second, a link function decides how the linear predictor connects to the distribution’s mean. Choose Gaussian + identity link and you recover OLS. Choose Bernoulli + logit link and you recover logistic regression. Choose Poisson + log link and you recover Poisson regression. Nothing else changes.

The whole framework really is just two interchangeable slots feeding one fixed engine — picture it as a control panel:

slot 1: distribution Gaussian Bernoulli Poisson Gamma … slot 2: link g identity logit log inverse … one engine: IRLS (max-likelihood)

The exponential-family form

A distribution belongs to the exponential family if its density (or mass) for a single observation \(y\) can be written as

\[ p(y \mid \theta, \phi) = \exp\!\left( \frac{y\,\theta - b(\theta)}{a(\phi)} + c(y, \phi) \right). \]

In words: any exponential-family distribution is “an exponential of (data times a parameter, minus a normalizer), scaled by a spread term” — a single mould that Gaussian, Bernoulli, and Poisson all pour into.

Also written: \(p(y\mid\theta,\phi) = h(y,\phi)\,\exp\!\big(\frac{y\theta - b(\theta)}{a(\phi)}\big)\), pulling the base measure out as a multiplicative factor \(h(y,\phi)=e^{c(y,\phi)}\).

Each piece has a name and a job:

Symbol Name Role
\(\theta\) natural (canonical) parameter the “knob” the linear predictor will ultimately turn
\(y\) sufficient statistic the only thing about the data the parameter interacts with
\(b(\theta)\) log-partition (cumulant) function normalizes the density; its derivatives give the mean and variance
\(a(\phi)\) dispersion / scale spreads or shrinks the distribution (e.g. Gaussian variance)
\(c(y,\phi)\) base measure a term independent of \(\theta\); it does not affect the fit

The magic is in \(b(\theta)\). Two clean identities fall straight out of the requirement that the density integrates to one:

\[ \mathbb{E}[y] = \mu = b'(\theta), \qquad \operatorname{Var}[y] = a(\phi)\,b''(\theta). \]

In words: differentiate the normalizer once to get the mean, twice to get the variance (up to the spread factor) — no integrals required.

Also written: \(\mu = \frac{db}{d\theta}\) and \(\operatorname{Var}[y] = a(\phi)\frac{d^2 b}{d\theta^2}\), the first and second derivatives (cumulants) of \(b\).

So the mean is the first derivative of the log-partition, and the variance is the second derivative times the dispersion. The factor \(b''(\theta)\), written as a function of \(\mu\), is called the variance function \(V(\mu)\) — it tells you how the noise scales with the mean. For a Gaussian it is constant; for a Poisson it equals the mean itself. That single difference is why count data fans out as it grows.

Tip

The whole family is “designed” so that knowing the natural parameter \(\theta\) tells you everything: the mean is \(b'(\theta)\), the variance is \(b''(\theta)\) up to scale. You never integrate anything by hand — you differentiate \(b\).

Three familiar models, one template

Let us actually unpack the algebra for the three workhorses. The trick each time is to take the textbook density and bend it into the \(\exp\!\big(\frac{y\theta - b(\theta)}{a(\phi)} + c\big)\) shape, then read off the pieces.

Gaussian. The density \(\frac{1}{\sqrt{2\pi\sigma^2}}\exp\!\big(-\frac{(y-\mu)^2}{2\sigma^2}\big)\) expands. The cross term \(y\mu/\sigma^2\) is linear in \(y\), so \(\theta = \mu\) and \(b(\theta) = \theta^2/2\). Then \(b'(\theta) = \theta = \mu\) — the mean — and \(b''(\theta) = 1\), a constant variance function. Dispersion \(a(\phi) = \sigma^2\).

Bernoulli. The mass \(\mu^y(1-\mu)^{1-y}\) becomes \(\exp\!\big(y\log\frac{\mu}{1-\mu} + \log(1-\mu)\big)\). The thing multiplying \(y\) is the log-odds, so \(\theta = \log\frac{\mu}{1-\mu}\). Inverting gives \(\mu = \frac{1}{1+e^{-\theta}}\) — the logistic sigmoid appears on its own. Here \(b(\theta) = \log(1+e^{\theta})\) and \(a(\phi)=1\).

Poisson. The mass \(\frac{\mu^y e^{-\mu}}{y!}\) becomes \(\exp\!\big(y\log\mu - \mu - \log y!\big)\). The coefficient of \(y\) is \(\log\mu\), so \(\theta = \log\mu\), giving \(\mu = e^{\theta}\), \(b(\theta) = e^{\theta}\), and \(a(\phi)=1\).

Model Distribution \(\theta\) (natural param) \(b(\theta)\) \(\mu = b'(\theta)\) \(V(\mu) = b''\)
Linear Gaussian \(\mu\) \(\theta^2/2\) \(\theta\) \(1\)
Logistic Bernoulli \(\log\frac{\mu}{1-\mu}\) \(\log(1+e^\theta)\) \(\frac{1}{1+e^{-\theta}}\) \(\mu(1-\mu)\)
Poisson Poisson \(\log\mu\) \(e^\theta\) \(e^\theta\) \(\mu\)

Read the third column. In every case, the natural parameter \(\theta\) is some specific function of the mean \(\mu\). That function — the one that turns the mean into the natural parameter — is exactly the canonical link.

The canonical link

The link function \(g\) is the bridge between the linear world and the mean of the response:

\[ g(\mu) = \eta = \mathbf{x}^\top\boldsymbol\beta. \]

In words: transform the mean by \(g\) so it lands on the same unbounded scale as the weighted sum of features.

Also written: \(\mu = g^{-1}(\mathbf{x}^\top\boldsymbol\beta)\) — solving for the mean with the inverse link, also called the activation or mean function.

You are always free to pick any monotonic, differentiable \(g\). But there is one canonical choice that makes the math line up beautifully: set the link equal to the map from mean to natural parameter, so that

\[ g(\mu) = \theta \quad\Longrightarrow\quad \theta = \mathbf{x}^\top\boldsymbol\beta. \]

When you use the canonical link, the linear predictor is the natural parameter. Reading off the table:

  • Gaussian: \(\theta = \mu\), so the canonical link is the identity, \(g(\mu)=\mu\). Linear regression.
  • Bernoulli: \(\theta = \log\frac{\mu}{1-\mu}\), so the canonical link is the logit. Logistic regression.
  • Poisson: \(\theta = \log\mu\), so the canonical link is the log. Poisson regression.

flowchart LR
    X["features x"] -->|"β"| ETA["linear predictor<br/>η = xᵀβ"]
    ETA -->|"inverse link g⁻¹"| MU["mean μ"]
    MU -->|"exp-family<br/>distribution"| Y["response y"]
    subgraph canonical["canonical link: g(μ) = θ"]
        direction TB
        G1["Gaussian → identity"]
        G2["Bernoulli → logit"]
        G3["Poisson → log"]
    end

The canonical link is not mandatory — you can do Poisson regression with an identity link, or fit a probit model (Gaussian-CDF link) on binary data instead of logit. But the canonical link buys real conveniences: the score equations simplify, the observed and expected information matrices coincide, and the sufficient statistic for \(\boldsymbol\beta\) becomes simply \(\mathbf{X}^\top\mathbf{y}\). That last fact is why logistic and Poisson fits are so well-behaved.

Warning

“Canonical” means mathematically convenient, not always correct. The link should match how your mean actually behaves — a log link is right for counts because it keeps \(\mu>0\), but if your binary outcome is better described by a latent Gaussian threshold, probit may fit better than logit even though logit is canonical.

Fitting by maximum likelihood: the IRLS view

A GLM is fit by maximum likelihood — find the \(\boldsymbol\beta\) that makes the observed data most probable. Stack the per-observation log-densities into the log-likelihood \(\ell(\boldsymbol\beta)\) and set its gradient (the score) to zero. Using the chain rule through \(\eta \to \mu \to \theta\), the score for the whole dataset collapses into a strikingly uniform expression:

\[ \frac{\partial \ell}{\partial \boldsymbol\beta} = \sum_{i=1}^n \frac{(y_i - \mu_i)}{V(\mu_i)\,g'(\mu_i)}\,\mathbf{x}_i = \mathbf{0}. \]

In words: at the best fit, the features must be uncorrelated with the residuals — but each residual is first reweighted by how noisy that point is and how steep the link is there.

Also written: \(\mathbf{X}^\top \mathbf{W}(\mathbf{y} - \boldsymbol\mu)^{*} = \mathbf{0}\), the score in matrix form, where the per-point factor \(\frac{1}{V(\mu_i)g'(\mu_i)}\) rolls into a diagonal weight applied to each scaled residual.

Notice the shape: every term is a residual \(y_i - \mu_i\), reweighted by how noisy that point is (\(V(\mu_i)\)) and how steep the link is there (\(g'(\mu_i)\)). Unlike OLS, there is no closed-form solution in general — the \(\mu_i\) depend nonlinearly on \(\boldsymbol\beta\). So we solve it iteratively.

The standard solver is Iteratively Reweighted Least Squares (IRLS). Here is the plain idea before the symbols: you cannot solve the whole curved problem in one shot, so you fake it as a straight one. Stand at your current guess, pretend the model is locally an ordinary weighted least-squares fit with a made-up target and made-up weights, solve that easy problem, step to the answer, and repeat. Each round the fake gets closer to the truth, and after a handful of rounds the coefficients stop moving. (It is exactly Newton’s method with Fisher scoring, wearing the disguise of “just one more weighted OLS fit.”)

The two made-up ingredients are the working response \(z\) and the working weights \(w\):

\[ z_i = \eta_i + (y_i - \mu_i)\,g'(\mu_i), \qquad w_i = \frac{1}{V(\mu_i)\,g'(\mu_i)^2}, \] \[ \boldsymbol\beta^{\text{new}} = (\mathbf{X}^\top \mathbf{W}\mathbf{X})^{-1}\mathbf{X}^\top \mathbf{W}\mathbf{z}. \]

In words: pretend, for this one step, that the problem is ordinary weighted least squares with a faked-up target \(z\) and weights \(w\), solve it, then recompute and repeat until nothing moves.

Also written: the update is the weighted normal equation \(\boldsymbol\beta^{\text{new}} = \arg\min_\beta \sum_i w_i (z_i - \mathbf{x}_i^\top\beta)^2\) — every IRLS step is literally one weighted OLS fit.

The working response \(z\) is the current linear prediction nudged toward the data; the weights \(w\) downweight observations the model thinks are noisy. You recompute \(\mu\), \(z\), and \(w\) and repeat until \(\boldsymbol\beta\) stops moving. Each iteration is one weighted OLS fit — code you already have.

flowchart TD
    A["initialize β (e.g. β = 0)"] --> B["η = Xβ,  μ = g⁻¹(η)"]
    B --> C["working response<br/>z = η + (y−μ)·g'(μ)"]
    C --> D["weights<br/>w = 1 / (V(μ)·g'(μ)²)"]
    D --> E["weighted least squares<br/>β = (XᵀWX)⁻¹XᵀWz"]
    E --> F{"β converged?"}
    F -->|no| B
    F -->|yes| G["return β̂"]

Worked example: Poisson regression by hand-rolled IRLS

Take a tiny count dataset — say, number of defects \(y\) on a unit versus a single exposure feature \(x\). We fit a Poisson GLM with the canonical log link, where \(\mu = e^{\eta}\), the variance function is \(V(\mu)=\mu\), and the link derivative is \(g'(\mu)=1/\mu\). Plugging into the IRLS formulas, the working weights become simply \(w_i = \mu_i\) and the working response is \(z_i = \eta_i + (y_i-\mu_i)/\mu_i\).

import numpy as np

# tiny dataset: feature x, count response y
x = np.array([1.0, 2.0, 3.0, 4.0, 5.0])
y = np.array([1,    1,    2,    4,    8])
X = np.column_stack([np.ones_like(x), x])   # add intercept column

beta = np.zeros(2)                           # start at 0
for _ in range(25):
    eta = X @ beta
    mu  = np.exp(eta)                        # inverse log link
    z   = eta + (y - mu) / mu                # working response
    w   = mu                                 # working weights (Poisson)
    WX  = X * w[:, None]
    beta = np.linalg.solve(X.T @ WX, X.T @ (w * z))   # weighted LS

print("beta =", beta)                        # intercept, slope
print("fitted mu =", np.exp(X @ beta))
# beta = [-0.401  0.494]
# fitted mu = [0.91 1.49 2.44 3.99 6.54]

In a handful of iterations the slope settles near \(0.49\), meaning each unit of \(x\) multiplies the expected count by \(e^{0.49}\approx 1.64\) — the multiplicative interpretation that the log link guarantees. The fitted means stay positive automatically, something an identity-link linear fit could never promise on counts. Swap two lines — mu = 1/(1+np.exp(-eta)), w = mu*(1-mu) — and the same loop fits logistic regression. That reuse is the entire point of the GLM framework: one algorithm, parameterized by the distribution’s variance function and link.

For real work you would lean on statsmodels, which gives you the same fit plus standard errors, deviance, and the full inference table that maximum likelihood affords:

import numpy as np
import statsmodels.api as sm

x = np.array([1.,2.,3.,4.,5.]); y = np.array([1,1,2,4,8])
X = sm.add_constant(x)
glm = sm.GLM(y, X, family=sm.families.Poisson()).fit()   # IRLS under the hood
print(glm.params)        # ~[-0.40, 0.49], matching the hand-rolled loop
print(glm.summary())     # std errors, deviance, p-values for free
Warning

IRLS can diverge or stall when the model is near-separable (logistic data perfectly split by a feature) or when extreme fitted means make weights overflow. Watch for coefficients sprinting toward \(\pm\infty\), add a tiny ridge penalty to \(\mathbf{X}^\top\mathbf{W}\mathbf{X}\), or fall back to a damped Newton step. These are not flaws in the data so much as the likelihood honestly reporting that the maximum lies at infinity.

8.10 — Quick reference

Term / formula Meaning in one line When / why to use
OLS — \(\hat\beta=(X^\top X)^{-1}X^\top y\) Line minimizing squared residuals; closed-form normal equation Small \(p\), baseline fit, interpretable coefficients
Gradient descent Walk \(\beta\) downhill on the loss Many features/rows where inverting \(X^\top X\) is too costly
Coefficient \(\beta_j\) Change in \(y\) per unit of feature \(j\), others fixed Reading effect sizes — only meaningful on standardized features
Residual \(y_i-\hat y_i\) What the model failed to explain at point \(i\) Diagnose fit; flat band = good, fan/curve = broken assumption
Polynomial features \(x,x^2,\dots,x^d\) Bend the fit while staying linear-in-\(\beta\) Curved relationships; pick degree by held-out error
Bias²+Var+\(\sigma^2\) Expected test error splits into 3 parts Mental model for every over/underfit decision
Ridge (L2) — \(\lambda\sum\beta_j^2\) Shrink all coefficients smoothly toward 0 Many correlated features each contributing a little
Lasso (L1) — \(\lambda\sum|\beta_j|\) Drives weak coefficients to exactly 0 Sparse/interpretable model; automatic feature selection
Elastic Net Blend of L1 + L2 penalties Sparsity wanted but features come in correlated groups
Sigmoid \(\sigma(z)=1/(1+e^{-z})\) Squash a linear score into a \([0,1]\) probability Logistic regression’s link to class probability
Log-odds / odds ratio \(e^{\beta_j}\) Each unit multiplies the odds by \(e^{\beta_j}\) Interpreting logistic coefficients (medicine, credit)
Cross-entropy loss Penalize confident wrong probabilities heavily Training logistic regression / classifiers
GLM — \(g(\mathbb{E}[y])=x^\top\beta\) Linear predictor + link + target distribution Counts (Poisson/log), positive skewed (Gamma), binary (logit)
Canonical link \(g(\mu)=\theta\) Link that makes \(\eta\) equal the natural parameter Convenient ML fit; identity/logit/log for Gaussian/Bernoulli/Poisson
IRLS Fit a GLM as repeated weighted OLS steps The one solver behind all maximum-likelihood GLM fits
Pinball loss \(\mathcal{L}_\tau\) Asymmetric absolute loss targeting the \(\tau\)-quantile Predicting percentiles (p95 latency, demand bands)
Huber loss Squared near 0, linear past \(\delta\) Robust fit when a few outliers would hijack OLS
RMSE / MAE Typical error size in real units Report “how wrong on average”; gap signals outliers
\(R^2\) / adjusted \(R^2\) Fraction of variance explained (adj. taxes extra features) Compare to mean-baseline; adjusted when models differ in size

8.11 — Key takeaways

  • OLS fits a hyperplane by minimizing squared residuals; solve it exactly via the normal equation \((X^\top X)^{-1}X^\top y\) for small \(p\), or by gradient descent at scale.
  • A coefficient is the change in the target per one-unit change in its feature, others held fixed — interpretable only when features are comparably scaled.
  • Polynomial regression is still linear-in-coefficients; degree controls the bias–variance trade — pick it by held-out error, never training error.
  • The bias–variance decomposition splits expected error into bias², variance, and an irreducible noise floor \(\sigma^2\) — the reason every regularizer trades a little bias for a lot of variance.
  • Ridge (L2) shrinks coefficients smoothly; Lasso (L1) zeroes some out for feature selection; Elastic Net blends both. Always standardize first and never penalize the intercept.
  • Logistic regression is classification: sigmoid turns a linear score into a probability, coefficients are log-odds, and the boundary \(x^\top\beta=0\) is linear.
  • GLMs unify all of the above as a linear predictor plus a link function and a target distribution (Gaussian/identity, Bernoulli/logit, Poisson/log), fit by maximum likelihood via IRLS — one weighted-OLS loop per step.
  • Quantile regression (pinball loss) predicts percentiles, not means; robust regression (Huber loss) keeps outliers from hijacking the fit.
  • Evaluate with RMSE/MAE (real units), \(R^2\) (variance explained), and adjusted \(R^2\) when comparing models of different size.

8.12 — See also

  • Optimization — gradient descent, convexity, and the machinery behind fitting these losses.
  • Linear Algebra — matrices, projections, and the normal equation’s geometry.
  • Probability & Statistics — likelihood, distributions, and the inference behind GLMs and OLS assumptions.
  • Classification Algorithms — where logistic regression sits among classifiers and how to evaluate them.
  • Model Evaluation & Tuning — cross-validation, the bias–variance trade-off, and regularization strength selection.
  • Data Preprocessing — feature scaling and encoding that regularized and GLM models depend on.
  • Neural Networks (Core) — how a stack of these linear-plus-nonlinearity units generalizes regression.

↪ The thread continues → Chapter 09 · 📐 Classification Algorithms

Regression predicts a number; flip the question to predicting a category and you get classification — the same machinery aimed at a different target, and the workhorse of applied ML.


📖 All chapters  |  ← 07 · 🗜️ Dimensionality Reduction  |  09 · 📐 Classification Algorithms →

 

© Kader Mohideen