Chapter 09 — 📐 Classification Algorithms

📖 All chapters | ← 08 · 📈 Regression | 10 · 🌳 Ensemble Methods →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Classification is the task of assigning a discrete label to an input: spam or not, dog or cat, which of ten digits. It sits at the heart of classical machine learning — once you can draw a boundary that separates classes, you can power fraud filters, medical screens, and image taggers. This chapter walks through the workhorse classifiers, from the lazy memorizer that just looks at neighbors to the margin-maximizing SVM, building each one from its core intuition up.

🧭 In context: Classical Machine Learning · supervised prediction of discrete labels · every classifier is just a different way to carve the feature space into decision regions.

💡 Remember this: Every classifier in this chapter does the same job — carve the feature space into regions — and differs only in what shape of boundary it bets on; start with logistic regression and climb in complexity only when the metric demands it.

The big picture first. Every classifier in this chapter does the same job — split the feature space into regions, one per class — but each makes a different bet about what the boundary should look like. k-NN bets “you look like your neighbors.” Naive Bayes bets “features are independent given the class.” SVM bets “the widest gap is safest.” Trees bet “a stack of yes/no questions is enough.” Keep that one sentence per method in mind and the rest is detail.

9.1 — k-Nearest Neighbors

The simplest idea in all of classification: to label a new point, look at the points near it and let them vote. k-Nearest Neighbors (k-NN) stores the entire training set and, at prediction time, finds the $k$ closest training points to the query and returns their majority class. There is no “training” beyond memorizing the data — which is why k-NN is called a lazy learner: it defers all work to query time, in contrast to eager learners (trees, SVMs) that build a model up front.

Intuition: it’s the “you are who you hang out with” classifier. Drop a new point on the map, find its $k$ closest companions, and let them outvote each other. No formula is learned — the training data is the model.

The whole method rests on a distance metric. The default is Euclidean distance, the straight-line distance:

\[d(\mathbf{x}, \mathbf{x}') = \sqrt{\sum_{i=1}^{n}(x_i - x'_i)^2}\]

In words: square the gap on each feature, add them up, take the square root — the ordinary “as the crow flies” distance. Also written: $d(\mathbf{x},\mathbf{x}') = \|\mathbf{x}-\mathbf{x}'\|_2 = \sqrt{(\mathbf{x}-\mathbf{x}')^\top(\mathbf{x}-\mathbf{x}')}$ (vector/norm form).

Other choices: Manhattan distance $\sum_i |x_i - x'_i|$ (movement along a grid), and Minkowski distance $\left(\sum_i |x_i - x'_i|^p\right)^{1/p}$, which generalizes both ($p=2$ Euclidean, $p=1$ Manhattan).

Worked example. Three training points, one query at $\mathbf{q}=(2,2)$:

point	coords	class	dist to q
A	(1,1)	🔴	$\sqrt{2}\approx1.41$
B	(3,1)	🔵	$\sqrt{2}\approx1.41$
C	(4,4)	🔵	$\sqrt{8}\approx2.83$

With $k=3$: the three nearest are A, B, C → votes 🔴×1, 🔵×2 → predict 🔵. With $k=1$ there is a genuine problem: A and B sit at the same distance $\sqrt2$ from the query, so the “single nearest neighbor” is a tie between A (🔴) and B (🔵). The prediction is undefined until you break the tie — by a fixed rule (lowest index), by random choice, or by falling back to a larger $k$. The lesson: small $k$ is fragile, and the answer can flip depending on a tie-break you may not even notice.

import numpy as np
def knn_predict(X, y, q, k=3):
    d = np.sqrt(((X - q)**2).sum(axis=1))   # distance to every training point
    idx = d.argsort()[:k]                    # k smallest (argsort breaks ties by index)
    return np.bincount(y[idx]).argmax()      # majority vote; ties -> lowest label

Note the two implicit tie-breaks here: argsort orders equal distances by their original index, and bincount(...).argmax() returns the lowest label on a vote tie. The prose advice to use odd $k$ removes vote ties for two classes, but as the code shows it does not remove distance ties — those are settled by index order. Be aware of both.

In practice you never hand-roll this — scikit-learn’s KNeighborsClassifier builds a spatial index (a KD-tree or ball-tree) so queries don’t scan all $N$ points, and bundles scaling into a pipeline:

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# StandardScaler is essential — k-NN is scale-sensitive (see warning below)
clf = make_pipeline(StandardScaler(),
                    KNeighborsClassifier(n_neighbors=3, weights="distance"))
clf.fit(X_train, y_train)            # "fit" just stores the (scaled) data + builds the index
print(clf.predict(q.reshape(1, -1)))

weights="distance" is a useful upgrade over plain voting: closer neighbors get a louder vote (weight $\propto 1/d$), which softens the tie problem and often beats uniform voting.

Choosing k is the central knob. Small $k$ (e.g. 1) gives a jagged boundary that fits noise — high variance. Large $k$ smooths the boundary but can wash out real structure — high bias. A common rule of thumb is $k \approx \sqrt{N}$, tuned by cross-validation. Use odd $k$ for two classes to avoid vote ties.

The SVG below shows how the decision regions change. With $k=1$ the boundary hugs every point (note the small island of red intruding into blue territory); larger $k$ produces a smoother frontier.

Warning

k-NN is acutely sensitive to feature scale: a feature measured in thousands (salary) will dominate the distance and drown out a feature in single digits (years of experience). Always standardize features before using k-NN. It also suffers the curse of dimensionality — in high dimensions all points become roughly equidistant.

9.2 — Naive Bayes

Naive Bayes flips the question around. Instead of asking “where is the boundary?”, it asks “given this input, which class makes it most probable?” — a generative view. It builds on Bayes’ rule:

Intuition: you start with how common each class is (the prior), then let the evidence — the words in the email, the pixels in the image — push your belief up or down. Naive Bayes is just bookkeeping for “which explanation fits the evidence best.”

\[P(C \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid C)\, P(C)}{P(\mathbf{x})}\]

In words: the belief in a class after seeing the data equals how well that class predicts the data, times how common the class was to begin with, rescaled so the probabilities add to one. Also written: $\text{posterior} \propto \text{likelihood} \times \text{prior}$, i.e. $P(C\mid\mathbf{x}) \propto P(\mathbf{x}\mid C)\,P(C)$ (the denominator $P(\mathbf{x})$ is constant across classes and drops out of the argmax).

where $P(C)$ is the prior (how common the class is), $P(\mathbf{x}\mid C)$ is the likelihood (how typical the features are for that class), and $P(C\mid\mathbf{x})$ is the posterior we want. Since $P(\mathbf{x})$ is the same across classes, we just pick the class maximizing $P(\mathbf{x}\mid C)P(C)$.

The hard part is $P(\mathbf{x}\mid C)$ for many features. The “naive” conditional independence assumption says: given the class, every feature is independent of the others. That turns a joint probability into a simple product:

\[P(\mathbf{x}\mid C) = \prod_{i} P(x_i \mid C)\]

In words: pretend the features don’t talk to each other once you know the class, so the chance of the whole input is just each feature’s chance multiplied together. Also written: in log space, $\log P(\mathbf{x}\mid C) = \sum_i \log P(x_i\mid C)$ — a sum instead of a product (the form actually used in code, to dodge underflow).

This is almost never literally true (in spam, “free” and “money” co-occur), yet it works remarkably well in practice and makes the math trivial.

Three flavors fit different data:

variant	feature type	example
Gaussian	continuous	iris petal length
Multinomial	counts	word frequencies in a document
Bernoulli	binary	word present / absent

Worked example — spam. Priors: $P(\text{spam})=0.4$, $P(\text{ham})=0.6$. Word likelihoods:

word	P(word\|spam)	P(word\|ham)
“free”	0.6	0.1
“money”	0.5	0.1

Email contains both words. Spam score: $0.4 \times 0.6 \times 0.5 = 0.12$. Ham score: $0.6 \times 0.1 \times 0.1 = 0.006$. Spam wins by $0.12/0.006 = 20\times$, so classify as spam.

The SVG makes the mechanism visible: for each class we stack the prior against the two likelihoods, multiply down the column, and compare the final bars. Spam’s tall likelihoods overwhelm its slightly smaller prior.

# multinomial-style scoring in log space (avoids underflow)
import numpy as np
log_spam = np.log(0.4) + np.log(0.6) + np.log(0.5)   # = -2.12
log_ham  = np.log(0.6) + np.log(0.1) + np.log(0.1)   # = -5.12
print("spam" if log_spam > log_ham else "ham")        # -> spam

The log-space difference $-2.12 - (-5.12) = 3.0$ corresponds to a ratio $e^{3.0}\approx20$ — the same 20× margin, now computed by addition instead of multiplication, which is what keeps long documents from underflowing to zero.

The whole text-classification pipeline — counting words, smoothing, log-space scoring — is two lines in scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

# alpha=1.0 IS Laplace smoothing; class_prior left None -> learned from data
clf = make_pipeline(CountVectorizer(), MultinomialNB(alpha=1.0))
clf.fit(["free money now", "lunch at noon"], ["spam", "ham"])
print(clf.predict(["claim your free money"]))   # -> ['spam']
print(clf.predict_proba(["claim your free money"]))  # calibrated-ish class probabilities

Tip

Work in log space and add Laplace smoothing ($+1$ to every count). Without smoothing, a single unseen word gives probability 0, which zeroes out the entire product no matter how strong the other evidence.

9.3 — Logistic Regression as a Classifier

It is impossible to talk about classification without the model that quietly underpins half the chapter: logistic regression. Despite the name it is a classifier, and it is the binary cousin of the softmax you will meet in §9.9. (The full derivation lives in the Regression chapter; here we use it purely as a workhorse linear classifier and probability source.)

Intuition: take the perceptron’s straight-line score $\mathbf{w}\cdot\mathbf{x}+b$, but instead of slamming it to a hard 0/1 with a step function, squash it through an S-curve so it becomes a smooth probability between 0 and 1. Now you don’t just get a label — you get a confidence.

The squash is the logistic (sigmoid) function:

\[P(y=1\mid\mathbf{x}) = \sigma(z) = \frac{1}{1+e^{-z}}, \qquad z = \mathbf{w}\cdot\mathbf{x}+b\]

In words: run the linear score $z$ through an S-shaped curve that maps any real number into $(0,1)$ — large positive $z$ → near 1, large negative → near 0, $z=0$ → exactly $0.5$ (the decision boundary). Also written: equivalently $\log\dfrac{P(y=1)}{P(y=0)} = \mathbf{w}\cdot\mathbf{x}+b$ — the model is linear in the log-odds (logit form).

It is trained by minimizing log loss (cross-entropy), which rewards confident-correct predictions and punishes confident-wrong ones:

\[\mathcal{L} = -\frac{1}{N}\sum_{i}\big[y_i\log\hat p_i + (1-y_i)\log(1-\hat p_i)\big]\]

In words: average penalty across examples — when the true label is 1 you pay $-\log\hat p$ (huge if you said $\hat p\approx 0$), when it’s 0 you pay $-\log(1-\hat p)$; being confidently wrong is expensive. Also written: $\mathcal{L} = \frac{1}{N}\sum_i \log\!\big(1+e^{-y_i' z_i}\big)$ using $\pm1$ labels $y_i'$ — the equivalent softplus form.

Worked example. Weights $\mathbf{w}=(1.0,\ -2.0)$, $b=0.5$, input $\mathbf{x}=(2,1)$. Then $z = 1.0\cdot2 + (-2.0)\cdot1 + 0.5 = 0.5$, and $\hat p = \sigma(0.5) = 1/(1+e^{-0.5}) = 0.622$. Since $0.622 > 0.5$, predict class 1 with 62% confidence — a soft answer, not a bare label.

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1.0)            # C = inverse regularization, like SVM's C
clf.fit(X_train, y_train)
print(clf.predict_proba(X_test)[:3])        # genuine probabilities, not just labels
print(clf.coef_, clf.intercept_)            # interpretable weights & bias

Logistic regression is the default baseline for almost any classification problem: fast, linear, interpretable (a positive weight pushes toward class 1), and it emits well-behaved probabilities out of the box — which is exactly what the next section is about.

Tip

Reach for logistic regression first on a new tabular problem. If a linear model already hits your target metric, you’ve saved yourself the tuning cost of SVMs and trees — and you get interpretable coefficients for free.

9.4 — Decision Trees

A decision tree classifies by asking a sequence of yes/no questions about features, splitting the data at each node until it reaches a leaf that votes a class. It mirrors how a human triages: “Is it raining? → Is it cold? → take the bus.” The appeal is interpretability — you can read the rules straight off the tree.

Intuition: it’s a flowchart the algorithm draws for itself. At every fork it asks the single question that best sorts the remaining examples into “mostly red” and “mostly blue” piles, then repeats inside each pile until the piles are pure enough.

The core question is which split to make. A good split makes the resulting groups purer (more dominated by one class). Two impurity measures:

Entropy (from information theory): $H = -\sum_c p_c \log_2 p_c$. Zero when a node is pure, maximal when classes are 50/50. Information gain = parent entropy minus the weighted child entropy; the tree greedily picks the split with the highest gain.
Gini impurity: $G = 1 - \sum_c p_c^2$. The probability of mislabeling a random element. Cheaper to compute (no logs), and in practice gives nearly identical trees.

For the information gain of a split:

\[\text{Gain} = H(\text{parent}) - \sum_{j}\frac{N_j}{N}\,H(\text{child}_j)\]

In words: how much uncertainty the question removes — the parent’s messiness minus the size-weighted average messiness of the groups it produces. Also written: $\text{Gain} = H(\text{parent}) - H(\text{parent}\mid \text{split})$, i.e. the mutual information between the split feature and the label.

Worked example. A node with 10 samples: 6 🔴, 4 🔵. Gini $= 1 - (0.6^2 + 0.4^2) = 1 - 0.52 = 0.48$. Suppose a split sends all 6 🔴 left (pure, $G=0$) and all 4 🔵 right (pure, $G=0$). Weighted child impurity $= 0$ → this is a perfect split, gain $= 0.48$.

import numpy as np
def gini(y):
    p = np.bincount(y) / len(y)
    return 1 - (p**2).sum()
parent = np.array([0,0,0,0,0,0,1,1,1,1])
left, right = parent[:6], parent[6:]
gain = gini(parent) - (len(left)/10*gini(left) + len(right)/10*gini(right))
print(gain)   # 0.48

The classic algorithms differ mainly in splitting criterion and feature handling:

algorithm	criterion	splits	features
ID3	information gain	multi-way	categorical only
C4.5	gain ratio	multi-way	categorical + continuous, handles missing
CART	Gini (classification)	always binary	both; also does regression

graph TD
  A["Outlook?"] -->|Sunny| B["Humidity?"]
  A -->|Overcast| C["Play ✅"]
  A -->|Rain| D["Wind?"]
  B -->|High| E["No ❌"]
  B -->|Normal| F["Play ✅"]
  D -->|Strong| G["No ❌"]
  D -->|Weak| H["Play ✅"]

The from-scratch Gini above is for understanding; in practice CART is one import, and the trained tree is fully readable:

from sklearn.tree import DecisionTreeClassifier, export_text
clf = DecisionTreeClassifier(criterion="gini", max_depth=3,   # max_depth = pre-pruning
                             min_samples_leaf=5)
clf.fit(X_train, y_train)
print(export_text(clf, feature_names=list(feature_names)))    # human-readable rules
# cost-complexity (post-)pruning path: pick ccp_alpha by cross-validation
print(clf.cost_complexity_pruning_path(X_train, y_train).ccp_alphas)

A fully grown tree memorizes the training set and overfits. Pruning cuts it back: pre-pruning stops growth early (max depth, min samples per leaf), while post-pruning grows the full tree then removes branches that don’t improve validation accuracy (cost-complexity pruning trades tree size against error).

Warning

Single trees are high-variance: shift a few training points and the whole structure can change. This instability is exactly what ensembles exploit — see Random Forests and boosting in the Ensemble Methods chapter.

9.5 — Support Vector Machines

Many lines can separate two classes; which is best? A Support Vector Machine (SVM) answers: the one with the widest margin — the largest gap between the boundary and the nearest points of either class. The intuition is robustness: a fat margin leaves the most room before a new point gets misclassified.

Intuition: imagine the boundary as the widest possible road you can pave between the two classes without driving over any point. The few points touching the curb on each side are the only ones that matter — pin them and the road is fixed.

The boundary is a hyperplane $\mathbf{w}\cdot\mathbf{x} + b = 0$. The only training points that matter are the ones sitting on the edge of the margin — the support vectors. Move any other point and the boundary doesn’t budge; move a support vector and it does. The margin width is $2/\|\mathbf{w}\|$, so maximizing the margin means minimizing $\|\mathbf{w}\|$ subject to every point being on the correct side: $y_i(\mathbf{w}\cdot\mathbf{x}_i + b) \ge 1$.

In words: the gap you want to widen shrinks as the weight vector grows, so making the margin big is the same as keeping the weights small — while still parking every point on its correct side of the road with a unit of clearance. Also written: maximize $\dfrac{2}{\|\mathbf{w}\|}$ $\Longleftrightarrow$ minimize $\dfrac{1}{2}\|\mathbf{w}\|^2$ subject to $y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\ge 1$ for all $i$.

Real data is rarely perfectly separable, so a soft margin allows violations via slack variables $\xi_i$, controlled by the regularization parameter C:

\[\min_{\mathbf{w},b}\ \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \xi_i\]

In words: still keep the weights small (wide road), but now pay a penalty $C$ for every point that strays into or across the margin — so $C$ sets how much you care about violations versus width. Also written: the unconstrained hinge-loss form $\min_{\mathbf{w},b}\ \tfrac{1}{2}\|\mathbf{w}\|^2 + C\sum_i \max\!\big(0,\ 1 - y_i(\mathbf{w}\cdot\mathbf{x}_i+b)\big)$.

C tunes the bias–variance tradeoff. Large $C$ punishes violations hard → narrow margin, fits training data tightly (risk: overfit). Small $C$ tolerates violations → wide margin, smoother (risk: underfit).

C	margin	training errors	behavior
large	narrow	few	low bias, high variance
small	wide	more allowed	high bias, low variance

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# scaling matters as much as for k-NN — the margin is a geometric object
clf = make_pipeline(StandardScaler(), SVC(kernel="rbf", C=1.0, gamma="scale"))
clf.fit(X_train, y_train)
print(clf.named_steps["svc"].support_vectors_.shape)  # only the SVs are kept
# for huge / sparse text data, LinearSVC (liblinear) is far faster than kernel SVC

Tip

SVMs shine in high-dimensional spaces (text, genomics) where the number of features exceeds samples, because only the support vectors define the model. Standardize features first — like distances in k-NN, the margin geometry depends on scale.

9.6 — Kernels & the Kernel Trick

A linear SVM fails when classes aren’t linearly separable — think red points in a ring around blue points; no straight line splits them. The fix: map the data into a higher-dimensional space where it becomes linearly separable, then separate it there with a hyperplane.

Intuition: if you can’t separate the points on a flat table, lift them into 3-D — give the inner ring a low height and the outer ring a high one — and now a flat sheet slides cleanly between them. The kernel trick is a way to get the benefit of that lift without ever paying to compute the lifted coordinates.

The genius is the kernel trick. The SVM optimization only ever needs dot products between points, never the points’ coordinates themselves. A kernel function $K(\mathbf{x},\mathbf{x}')$ computes the dot product in the high-dimensional space directly from the original inputs — without ever constructing the high-dimensional vectors. You get the power of a huge (even infinite) feature space at the cost of the original one.

\[K(\mathbf{x},\mathbf{x}') = \phi(\mathbf{x})\cdot\phi(\mathbf{x}')\]

In words: a kernel is a shortcut that returns the dot product of two points after they’ve been lifted by some feature map $\phi$ — computed straight from the raw inputs, skipping the lift entirely. Also written: $K(\mathbf{x},\mathbf{x}') = \langle \phi(\mathbf{x}),\,\phi(\mathbf{x}')\rangle$ (inner-product notation), for any $\phi$ the kernel implicitly defines.

kernel	formula	use
linear	$\mathbf{x}\cdot\mathbf{x}'$	already-separable / high-dim text
polynomial	$(\mathbf{x}\cdot\mathbf{x}' + c)^d$	feature interactions up to degree $d$
RBF (Gaussian)	$\exp(-\gamma\\|\mathbf{x}-\mathbf{x}'\\|^2)$	smooth, local, very flexible boundaries

Worked example — why it’s “implicit.” Take the 2-D polynomial kernel $K(\mathbf{x},\mathbf{z})=(\mathbf{x}\cdot\mathbf{z})^2$. Expand it:

\[(x_1 z_1 + x_2 z_2)^2 = (x_1^2)(z_1^2) + (x_2^2)(z_2^2) + 2x_1x_2 \cdot z_1z_2\]

That is exactly the dot product of $\phi(\mathbf{x}) = (x_1^2,\ x_2^2,\ \sqrt2\,x_1x_2)$ with $\phi(\mathbf{z})$. So computing $K$ with one multiply-and-square gives the same result as mapping both points to 3-D and dotting them — but we never built the 3-D vectors.

import numpy as np
x, z = np.array([1.,2.]), np.array([3.,4.])
K = (x @ z)**2                                   # kernel: cheap
phi = lambda v: np.array([v[0]**2, v[1]**2, np.sqrt(2)*v[0]*v[1]])
print(K, phi(x) @ phi(z))                        # 121.0  121.0  — identical

The RBF kernel’s $\gamma$ controls reach. It is the inverse of the squared Gaussian width: $\gamma = 1/(2\sigma^2)$, so a large $\gamma$ means a narrow bump — each point influences only its immediate neighborhood (wiggly boundary, overfit risk) — while a small $\gamma$ (wide $\sigma$) gives broad influence and a smoother boundary. It pairs with $C$ as the two main RBF-SVM knobs.

In code, swapping kernels and grid-searching the two knobs is trivial:

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(SVC(kernel="rbf"),
                    {"C": [0.1, 1, 10], "gamma": [0.01, 0.1, 1]}, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)   # C and gamma tuned together, as the warning below insists

Warning

A high-degree polynomial or large-$\gamma$ RBF can fit any training set perfectly and generalize terribly. Tune $\gamma$ and $C$ together with cross-validation, and prefer the linear kernel when features already vastly outnumber samples.

9.7 — Perceptron

The perceptron (Rosenblatt, 1958) is the original artificial neuron and the ancestor of every neural network. It computes a weighted sum of inputs, adds a bias, and outputs a class by thresholding the sum. Using the $\{0,1\}$ label convention — which is what the code below uses — the rule is a step at zero:

\[\hat{y} = \begin{cases} 1 & \text{if } \mathbf{w}\cdot\mathbf{x} + b > 0 \\ 0 & \text{otherwise} \end{cases}\]

In words: add up the inputs weighted by importance, add a bias offset, and fire a 1 if the total clears zero, otherwise a 0 — a hard yes/no. Also written: $\hat y = \mathbb{1}[\mathbf{w}\cdot\mathbf{x}+b > 0]$, or with $\pm1$ labels $\hat y = \operatorname{sign}(\mathbf{w}\cdot\mathbf{x}+b)$ (same boundary, different label code).

Learning is a beautifully simple mistake-driven update: loop over examples, and whenever the prediction is wrong, nudge the weights and the bias toward the correct answer:

\[\mathbf{w} \leftarrow \mathbf{w} + \eta\,(y - \hat{y})\,\mathbf{x}, \qquad b \leftarrow b + \eta\,(y - \hat{y})\]

In words: when you guess right, change nothing; when you guess wrong, shove the weights a small step $\eta$ in the direction of the input that would have fixed the answer (and nudge the bias the same way). Also written: since the update is gated by the error $(y-\hat y)\in\{-1,0,+1\}$, equivalently “for each misclassified point, $\mathbf{w}\leftarrow\mathbf{w}\pm\eta\mathbf{x}$” — plus on false negatives, minus on false positives.

Here $y\in\{0,1\}$ is the true label and $\hat y\in\{0,1\}$ the prediction, so the error term $(y-\hat y)$ is $+1$ on a false negative, $-1$ on a false positive, and $0$ when correct — meaning the weights only move on a mistake. If a point is misclassified, this shifts the boundary toward classifying it correctly. The perceptron convergence theorem guarantees that if the data is linearly separable, this finds a separating boundary in finite steps. (An equivalent formulation uses $\pm1$ labels with $\hat y=\text{sign}(\mathbf{w}\cdot\mathbf{x}+b)$; the two conventions give the same geometry, but you must keep labels, prediction, and update in one convention — mixing them is a classic bug.)

Worked example — learning AND. Inputs $(0,0)(0,1)(1,0)\to 0$, $(1,1)\to 1$. Starting from zero weights, the update rule walks $\mathbf{w}$ and $b$ until it finds e.g. $w=(1,1), b=-1.5$: only $(1,1)$ gives $1+1-1.5>0$. Done — AND is linearly separable.

import numpy as np
X = np.array([[0,0],[0,1],[1,0],[1,1]]); y = np.array([0,0,0,1])
w, b, lr = np.zeros(2), 0.0, 1.0
for _ in range(10):
    for xi, yi in zip(X, y):
        pred = 1 if w@xi + b > 0 else 0               # {0,1} threshold at 0
        w += lr*(yi - pred)*xi; b += lr*(yi - pred)   # update w AND b, only on error
print(w, b)   # a valid AND boundary

The from-scratch loop is the whole algorithm; scikit-learn wraps it (with averaging and regularization) as Perceptron:

from sklearn.linear_model import Perceptron
clf = Perceptron(eta0=1.0, max_iter=1000)
clf.fit(X, y)
print(clf.coef_, clf.intercept_)   # a separating boundary for AND

Warning

The convergence theorem only promises a boundary when the data is linearly separable — on non-separable data the perceptron never settles, it just cycles forever flipping the same few points. Always cap the iterations (max_iter), and don’t read a stopped-but-not-converged perceptron as having found the “best” line; it merely ran out of patience.

Why it cannot solve XOR. XOR returns 1 for $(0,1)$ and $(1,0)$ but 0 for $(0,0)$ and $(1,1)$. No straight line separates those two diagonal pairs — the classes are not linearly separable. This 1969 observation (Minsky & Papert) stalled neural-net research for a decade. The escape is stacking neurons into hidden layers, which the Neural Networks chapter develops.

9.8 — Discriminant Analysis

Discriminant analysis is a generative classifier in the same family as Naive Bayes, but instead of assuming features are independent, it models each class as a full multivariate Gaussian — a bell-shaped cloud with a mean and a covariance (its shape and tilt). To classify a point, it asks which class’s Gaussian, weighted by the prior, makes the point most likely.

Intuition: picture each class as a fuzzy blob of probability. A new point gets assigned to whichever blob it falls deepest inside — accounting for how big and how common that blob is. Naive Bayes draws those blobs as axis-aligned; discriminant analysis lets them tilt and stretch.

A quick word on what the covariance buys you. The covariance matrix $\Sigma$ encodes not just how wide each feature spreads, but how features co-vary — whether tall points also tend to be heavy, say. Naive Bayes forces $\Sigma$ to be diagonal (off-diagonal zeros), so its blobs are stuck axis-aligned. Discriminant analysis keeps the full matrix, so a blob can tilt along a diagonal and stretch into an ellipse. That extra freedom is exactly why it can place boundaries Naive Bayes cannot.

Linear Discriminant Analysis (LDA) makes one simplifying assumption: all classes share the same covariance matrix $\Sigma$ (same shape, just different centers). Because the blobs are the same shape, the curvy terms in the math cancel and the boundary comes out straight — a line in 2-D, a flat plane in higher dimensions. Where does that line sit? If both classes are equally common, it lands right in the middle, halfway between the two class centers. If one class is rarer, the boundary slides toward the rarer class — you demand stronger evidence before betting on the long shot. The line stays straight either way; only its position moves.

The score LDA assigns to each class (the discriminant function) is:

\[\delta_c(\mathbf{x}) = \mathbf{x}^\top\Sigma^{-1}\boldsymbol\mu_c - \tfrac{1}{2}\boldsymbol\mu_c^\top\Sigma^{-1}\boldsymbol\mu_c + \log P(c)\]

In words: reward a point for lining up with class $c$’s center (first term), penalize how far that center is from the origin (second term), and tilt by how common the class is (the prior term); pick the class with the biggest score. Also written: $\delta_c(\mathbf{x}) = -\tfrac{1}{2}(\mathbf{x}-\boldsymbol\mu_c)^\top\Sigma^{-1}(\mathbf{x}-\boldsymbol\mu_c) + \log P(c) + \text{const}$ — i.e. “negative Mahalanobis distance to the class center, plus log-prior” (the constant is shared, so it drops out of the argmax).

Worked example — one feature. Two classes with the same spread $\sigma^2=1$ but different centers: $\mu_A=0$, $\mu_B=4$. With equal priors the LDA boundary sits at the midpoint $x=2$ — points below 2 go to A, above go to B. Now make B rarer, $P(B)=0.2$ vs $P(A)=0.8$. The prior term $\log(0.8/0.2)=1.39$ shifts the cutoff to $x = 2 + 1.39/(\mu_B-\mu_A) \cdot \sigma^2 = 2 + 1.39/4 \approx 2.35$ — pushed toward the rare class B, exactly as the prose said: you need to be more clearly on B’s side before the model bets on the long shot.

Quadratic Discriminant Analysis (QDA) drops that assumption: each class gets its own covariance $\Sigma_c$. The quadratic terms survive, so the boundary is a curve (parabola, ellipse, hyperbola). More flexible, but it estimates many more parameters and needs more data.

	LDA	QDA
covariance	shared $\Sigma$	per-class $\Sigma_c$
boundary	linear	quadratic (curved)
parameters	fewer	many more
risk	underfit if shapes differ	overfit if data scarce

from sklearn.discriminant_analysis import (LinearDiscriminantAnalysis,
                                           QuadraticDiscriminantAnalysis)
lda = LinearDiscriminantAnalysis().fit(X_train, y_train)     # straight boundary
qda = QuadraticDiscriminantAnalysis().fit(X_train, y_train)  # curved boundary
print(lda.transform(X_train).shape)   # LDA also projects to <=K-1 dims (see tip)

Tip

LDA doubles as a supervised dimensionality reduction technique: it projects data onto the axes that best separate the classes. That projection view is covered in the Dimensionality Reduction chapter; here we use it purely as a classifier.

9.9 — Multi-class & Multi-label

Many classifiers (SVM, perceptron, logistic regression) are inherently binary — they draw one boundary. Real problems have many classes (10 digits) and sometimes many labels at once. Two settings, often confused:

Multi-class: each example gets exactly one of $K$ classes (a digit is one of 0–9).
Multi-label: each example can get several labels simultaneously (a photo tagged beach AND sunset AND dog).

For multi-class from binary learners, two reduction strategies:

One-vs-Rest (OvR): train $K$ classifiers, each “class $c$ vs everything else.” At prediction, run all $K$ and pick the highest-scoring. Cheap ($K$ models) but the sub-problems are imbalanced.

One-vs-One (OvO): train one classifier for every pair of classes — $\binom{K}{2}=K(K-1)/2$ of them — and let them vote. More models (10 classes → 45) but each trains on only two classes’ data, so each is small and balanced. SVMs default to OvO for this reason.

graph LR
  subgraph OvR["One-vs-Rest (3 classes)"]
    A1["A vs (B,C)"]
    A2["B vs (A,C)"]
    A3["C vs (A,B)"]
  end
  subgraph OvO["One-vs-One (3 classes)"]
    B1["A vs B"]
    B2["A vs C"]
    B3["B vs C"]
  end

The native multi-class approach is softmax (multinomial logistic regression): one model outputs a score per class and squashes them into a probability distribution that sums to 1:

\[P(y=c\mid \mathbf{x}) = \frac{e^{z_c}}{\sum_{k} e^{z_k}}, \qquad z_c = \mathbf{w}_c\cdot\mathbf{x}+b_c\]

In words: exponentiate every class’s raw score to make it positive, then divide by the total so the scores become probabilities that add to 1 — the biggest score becomes the most likely class. Also written: $\mathrm{softmax}(\mathbf{z})_c = \dfrac{\exp(z_c - \max_k z_k)}{\sum_k \exp(z_k - \max_k z_k)}$ — the numerically stable form (subtract the max before exponentiating to avoid overflow); it is the $K$-class generalization of the sigmoid from §9.3.

Worked example. Raw scores $z=(2.0,\ 1.0,\ 0.1)$. Exponentiate: $(7.39,\ 2.72,\ 1.11)$, sum $=11.22$. Softmax $=(0.659,\ 0.242,\ 0.099)$ — a clean probability vector; predict class 0. This is the standard output layer of neural-net classifiers.

For multi-label, the trick is binary relevance: train one independent binary classifier per label and let each fire independently — replace softmax with a per-label sigmoid so labels don’t compete. When label combinations are correlated, label powerset treats each observed combination as its own single class (turning multi-label back into multi-class), capturing co-occurrence at the cost of an explosion in classes.

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier
from sklearn.svm import SVC
# explicit OvR / OvO wrappers around any binary base learner
ovr = OneVsRestClassifier(SVC()).fit(X_train, y_train)   # K models
ovo = OneVsOneClassifier(SVC()).fit(X_train, y_train)    # K(K-1)/2 models
# multi-label: Y is an (n_samples, n_labels) 0/1 matrix; OvR = binary relevance
ml = OneVsRestClassifier(SVC()).fit(X_train, Y_multilabel)

Warning

Don’t use softmax for multi-label problems: softmax forces the outputs to sum to 1, so the classes compete — boosting one suppresses the others. For independent labels you want independent sigmoids, one per label.

9.10 — Probability Calibration

A classifier can be accurate yet lie about its confidence. If a weather model says “70% chance of rain” on a hundred days, it had better rain on about seventy of them — that property is calibration, and it is separate from getting the label right. It matters enormously when the probability itself drives a decision: a fraud threshold, a medical triage cutoff, an expected-value bet.

Intuition: accuracy asks “did you pick the right class?”; calibration asks “when you said 80%, were you right 80% of the time?” A model can ace the first and flunk the second — and a miscalibrated 0.9 that’s really a 0.6 will quietly wreck any threshold you set on it.

Some models are naturally well-calibrated (logistic regression, because it is trained on log loss). Others are notoriously not: an SVM’s decision-function output is a margin distance, not a probability; Naive Bayes is overconfident because its independence assumption double-counts correlated features, pushing scores toward 0 or 1.

You diagnose calibration with a reliability diagram — bin predictions by confidence, then plot predicted vs. actual frequency. A perfectly calibrated model lies on the diagonal.

We can also put a single number on calibration quality. The Expected Calibration Error (ECE) is the gap between confidence and accuracy, averaged over the bins of the reliability diagram:

\[\text{ECE} = \sum_{b=1}^{B}\frac{n_b}{N}\,\big|\,\text{acc}(b) - \text{conf}(b)\,\big|\]

In words: for each confidence bin, measure how far its average predicted probability sits from its actual hit rate, then take a size-weighted average of those gaps — zero means perfectly calibrated. Also written: $\text{ECE}=\mathbb{E}_{\hat p}\big[\,|\,\Pr(y=1\mid \hat p) - \hat p\,|\,\big]$ — the expected distance between the diagram’s curve and the diagonal.

The two standard fixes wrap a post-hoc calibrator around any trained model, fit on a held-out set:

Platt scaling — fit a 1-D logistic regression on the model’s scores ($p=\sigma(a\cdot s + b)$). Cheap, assumes a sigmoid-shaped distortion; great for SVMs.
Isotonic regression — fit any monotonic step function. More flexible, needs more data, can overfit small sets.

from sklearn.calibration import CalibratedClassifierCV
from sklearn.svm import SVC
# wrap an SVM (margins, not probabilities) -> calibrated probabilities
cal = CalibratedClassifierCV(SVC(), method="sigmoid", cv=5)  # "isotonic" for the flexible fit
cal.fit(X_train, y_train)
print(cal.predict_proba(X_test)[:3])   # now trustworthy as probabilities

Warning

Never calibrate on the training data — the model already fits it, so the curve will look perfect and mean nothing. Use a held-out split (which CalibratedClassifierCV does internally via cv). And remember: calibration changes the probabilities, not the ranking, so it leaves AUC unchanged while fixing log loss and threshold-based decisions.

9.11 — Imbalanced Classes & the Decision Threshold

Most real classification problems are lopsided: fraud is well under 1% of transactions, a rare disease afflicts a handful in a thousand, click-through is a sliver of impressions. On such data the default “predict the majority class” gets 99%+ accuracy while catching zero of the cases you actually care about. This is the single most common way a textbook-correct classifier fails in production, so it deserves its own treatment.

Intuition: if 999 of every 1000 emails are legitimate, a lazy model that screams “legit!” at everything is right 99.9% of the time and utterly useless. Accuracy rewards the lazy answer; the rare class is where all the value (and all the difficulty) lives.

Two levers fix this, and they operate at different stages.

1. Move the decision threshold. A probabilistic classifier outputs $\hat p = P(y=1\mid\mathbf{x})$ and labels positive when $\hat p > t$. The default $t=0.5$ is just a convention — and a bad one when the classes or the error costs are skewed. Lowering $t$ catches more positives (higher recall) at the cost of more false alarms (lower precision); raising it does the reverse. You pick $t$ to optimize the metric the business cares about, not to sit at 0.5.

The principled choice comes from expected cost. If a missed positive (false negative) costs $C_{FN}$ and a false alarm (false positive) costs $C_{FP}$, the cost-minimizing threshold is:

\[t^\star = \frac{C_{FP}}{C_{FP} + C_{FN}}\]

In words: set the cutoff by how the two mistakes trade off — when missing a positive hurts far more than a false alarm, $C_{FN}$ is large, the denominator grows, and the threshold drops well below 0.5 so you flag aggressively. Also written: flag positive when $\hat p \cdot C_{FN} > (1-\hat p)\cdot C_{FP}$, i.e. when the expected cost of not flagging exceeds the expected cost of flagging.

Worked example — fraud. A missed fraud costs $500 ($C_{FN}$); a false alarm costs $5 of review time ($C_{FP}$). Then $t^\star = 5/(5+500) = 0.0099 \approx 0.01$. So you should flag a transaction as fraud once its fraud probability clears just 1%, not 50% — the asymmetry of cost pushes the threshold 50× lower than the default.

2. Rebalance or reweight the training data. Instead of (or alongside) moving the threshold, you can make the rare class louder during fitting:

technique	what it does	watch out for
class weights	penalize errors on the rare class more in the loss	usually the cleanest fix — no data copied
random oversampling	duplicate minority examples	can overfit the duplicates
random undersampling	drop majority examples	throws away data
SMOTE	synthesize new minority points between neighbors	can blur the boundary if classes overlap

The laziest and often best of these is class weighting — most scikit-learn classifiers take class_weight="balanced", which scales each class’s loss inversely to its frequency, no resampling required:

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_curve
import numpy as np

# 'balanced' weights the rare class up automatically
clf = LogisticRegression(class_weight="balanced").fit(X_train, y_train)

# tune the threshold on a held-out set instead of accepting 0.5
p = clf.predict_proba(X_val)[:, 1]
prec, rec, thr = precision_recall_curve(y_val, p)
f1 = 2 * prec * rec / (prec + rec + 1e-9)
best_t = thr[np.argmax(f1[:-1])]        # threshold that maximizes F1
print("chosen threshold:", best_t)
y_pred = (clf.predict_proba(X_test)[:, 1] > best_t).astype(int)

Warning

Resample inside cross-validation, never before splitting — oversampling first leaks copies of the same point into both train and validation folds, inflating your score. Use imblearn’s Pipeline so SMOTE/undersampling runs per-fold. And judge imbalanced models with precision-recall curves, not ROC: on extreme imbalance, ROC-AUC looks deceptively rosy because the huge true-negative count dominates.

9.12 — Choosing a Classifier

With seven methods on the table, the practical question is which one first? There is no universal winner (the “no free lunch” theorem makes that formal), but there are strong defaults. The cheat-sheet below maps common situations to a sensible starting point — start simple, and only climb in complexity if the metric demands it.

situation	reach for	why
tabular baseline, want interpretability + probabilities	logistic regression	fast, linear, calibrated, readable weights
text / bag-of-words, spam	multinomial NB or linear SVM	sparse, high-dim, NB is instant
few features, nonlinear boundary, medium data	RBF SVM	flexible margin, robust
need human-readable rules	decision tree	reads off as if-then logic
many features ≫ samples (genomics)	linear SVM / logistic	margin needs few SVs; regularizes well
tiny dataset, Gaussian-ish classes	LDA	few parameters, stable
heavily imbalanced (fraud, rare disease)	logistic / tree + `class_weight`	reweight, then tune the threshold (§9.11)
lots of data, accuracy over everything	ensemble (see Ensemble ch.)	trees + boosting usually win

graph TD
  S["new classification task"] --> T{"text / sparse?"}
  T -->|yes| NB["Multinomial NB / Linear SVM"]
  T -->|no| I{"need interpretability?"}
  I -->|yes| TR["Logistic Regression / Tree"]
  I -->|no| D{"features ≫ samples?"}
  D -->|yes| L["Linear SVM / Logistic"]
  D -->|no| R["RBF SVM, then ensembles"]

Lazy rule of thumb: start with logistic regression. It trains in milliseconds, gives you probabilities and interpretable weights, and sets the bar every fancier model must beat. Only when it plateaus below your target do SVMs, trees, and ensembles earn their tuning cost.

Tip

Whatever you pick, judge it with the right metric, not raw accuracy — on a 99%-negative fraud set, “always negative” scores 99%. Use precision/recall, F1, ROC-AUC, or log loss as the problem demands. The Model Evaluation chapter develops these.

9.13 — Quick reference

Method / term	One-line meaning	When / why
k-NN	label by majority vote of the $k$ closest training points	tiny/medium data, nonlinear boundary; must standardize, dies in high dim
Euclidean distance $\\|\mathbf{x}-\mathbf{x}'\\|_2$	straight-line gap between points	default metric for k-NN; use Manhattan/Minkowski for grid-like features
Naive Bayes	$\text{posterior}\propto\text{likelihood}\times\text{prior}$ with independent features	text/spam, instant training, strong baseline on counts
Laplace smoothing ($+1$)	nudge every count up so no probability hits 0	always on for NB; one unseen word otherwise zeros the product
Logistic regression	sigmoid of a linear score → probability	first thing to try; fast, interpretable, well-calibrated
Log loss $-\sum[y\log\hat p+(1-y)\log(1-\hat p)]$	penalty for confident-wrong predictions	training objective that yields honest probabilities
Decision tree	stack of yes/no splits chosen to maximize purity	need human-readable if-then rules; prune to fight overfitting
Gini / Entropy	impurity measures a split tries to lower	Gini cheaper, near-identical trees; gain = parent − weighted child
SVM (margin)	widest road between classes; only support vectors matter	high-dim (features ≫ samples); standardize first
C	penalty for margin violations	large $C$ → narrow margin/overfit; small $C$ → wide/underfit
Kernel trick $K=\phi(\mathbf{x})\cdot\phi(\mathbf{x}')$	dot product in a lifted space, computed cheaply	nonlinear boundaries (RBF, polynomial) without building features
$\gamma$ (RBF)	inverse bump width, $1/(2\sigma^2)$	large $\gamma$ → wiggly/overfit; tune with $C$ via CV
Perceptron	step-threshold neuron with mistake-driven update	historical root of NNs; converges only if linearly separable (not XOR)
LDA / QDA	per-class Gaussians; shared vs own covariance	LDA linear & stable on small data; QDA curved, needs more data
OvR / OvO	reduce multi-class to many binary problems	OvR cheap ($K$ models); OvO balanced ($K(K-1)/2$), SVM default
Softmax $e^{z_c}/\sum_k e^{z_k}$	normalize scores into a probability distribution	native multi-class; one label per example, classes compete
Sigmoid (per-label)	independent 0/1 probability per label	multi-label — never softmax, labels must not compete
Calibration / ECE	are stated confidences actually right?	when the probability drives a decision; fix with Platt/isotonic
Threshold $t^\star=C_{FP}/(C_{FP}+C_{FN})$	cost-optimal cutoff, not 0.5	imbalanced/asymmetric-cost problems (fraud, disease)
`class_weight="balanced"`	up-weight the rare class in the loss	laziest imbalance fix; resample only inside CV

9.14 — Key takeaways

k-NN is a lazy learner — no training, just memorize and vote among nearest neighbors; sensitive to feature scale, the choice of $k$, and even to distance-tie-breaks at small $k$.
Naive Bayes uses Bayes’ rule with a (false but useful) conditional-independence assumption; fast, strong for text/spam; smooth and work in log space.
Logistic regression squashes a linear score through a sigmoid to emit a probability; the interpretable, well-calibrated default baseline and the binary cousin of softmax.
Decision trees split on impurity (entropy/Gini); interpretable but high-variance — prune to control overfitting.
SVMs maximize the margin; only support vectors matter; $C$ trades margin width against training errors.
Kernels let SVMs draw nonlinear boundaries by computing high-dimensional dot products implicitly — RBF and polynomial are the workhorses.
Perceptron is the original neuron with a mistake-driven update (weights and bias move only on errors); converges if data is linearly separable, but cannot solve XOR.
LDA/QDA are generative Gaussian classifiers — LDA (shared covariance) is linear, QDA (per-class covariance) is curved; unequal priors shift the LDA boundary off the midpoint.
Multi-class via OvR/OvO/softmax; multi-label via independent sigmoids or label powerset — never softmax for multi-label.
Calibration is confidence honesty, separate from accuracy; SVMs and NB are miscalibrated — fix with Platt scaling or isotonic regression on held-out data, and measure the gap with ECE.
Imbalanced classes break accuracy: reweight (class_weight="balanced") or resample inside CV, then move the decision threshold by the cost ratio $t^\star = C_{FP}/(C_{FP}+C_{FN})$ — judge with precision-recall, not ROC.
Choosing a classifier: start with logistic regression as a baseline, climb only when the metric demands it — and judge with the right metric, not raw accuracy.

9.15 — See also

Regression — logistic regression and the linear models that underpin softmax classification.
Ensemble Methods — Random Forests and gradient boosting that fix the high variance of single decision trees.
Model Evaluation & Tuning — cross-validation for choosing $k$, $C$, $\gamma$, and tree depth; precision/recall, ROC-AUC, log loss, reliability diagrams, and threshold selection on imbalanced data.
Neural Networks (Core) — how stacking perceptrons into hidden layers escapes the XOR limitation.
Dimensionality Reduction — LDA as a projection method, and PCA for preprocessing distance-based classifiers.
Probability & Statistics — Bayes’ rule, priors, and the Gaussian distributions behind Naive Bayes and discriminant analysis.

↪ The thread continues → Chapter 10 · 🌳 Ensemble Methods

A single classifier is good; a committee of them is better. Next we see how bagging and boosting combine many weak models into the ensembles that still win on tabular data.

📖 All chapters | ← 08 · 📈 Regression | 10 · 🌳 Ensemble Methods →