Chapter 07 — 🗜️ Dimensionality Reduction

📖 All chapters | ← 06 · 🧹 Data Preprocessing | 08 · 📈 Regression →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

High-dimensional data is hard to store, slow to train on, noisy, and impossible to picture. Dimensionality reduction is the craft of squeezing many features down to a handful of well-chosen ones while throwing away as little useful structure as possible. It sits at the end of the ML workflow’s data-shaping stage — after preprocessing, before modeling — and pays off in faster training, less overfitting, denoised signals, and 2-D pictures you can actually look at.

🧭 In context: part of the ML Workflow (data shaping) · used to compress features, denoise, visualize, and fight the curse of dimensionality · the one key idea: most real data of dimension \(D\) actually lives near a much lower-dimensional surface, and finding that surface is the whole game.

💡 Remember this: real high-dimensional data almost always lives near a much lower-dimensional surface, so you can throw away most of the dimensions and keep nearly all the signal — the only question is which method finds that surface.

Before any single method, fix the curse of dimensionality in your head: as \(D\) grows, the volume of space explodes, so any fixed number of points becomes hopelessly sparse, all pairwise distances drift toward equal, and notions like “nearest neighbor” stop meaning much. Reduction fights back by finding the few directions that actually carry the signal.

Here is the curse made concrete. Imagine sprinkling 1,000 points uniformly inside a unit cube. In 1-D (a line segment), the points are 0.001 apart on average — cozy. In 2-D they spread over an area; in 3-D over a volume; by 10-D you would need \(10^{30}\) points to keep that same density. The data you have stays fixed at 1,000, so the space around each point grows emptier and emptier. A quick way to feel it: the fraction of a \(D\)-dimensional cube’s volume that sits within a thin outer shell (the outer 10%) is \(1-0.9^{D}\). In 2-D that is 19%; by \(D=50\) it is 99.5% — almost all the volume hugs the boundary, so “typical” points are all out near the edges, far from the center and from each other.

The animation below watches that shell fill up as \(D\) climbs: the dark core (the roomy middle) shrinks while the colored rind (where points actually end up) swells to swallow the whole cube.

We split reduction into two families. Linear methods (PCA, LDA, ICA) find a new set of axes that are weighted sums of the originals — a rotation and projection. Nonlinear / manifold methods (Kernel PCA, t-SNE, UMAP) bend space to unfold curved structure a straight projection would flatten wrongly.

flowchart TD
    A[High-dimensional data] --> B{Use labels?}
    B -->|Yes, maximize class separation| C[LDA]
    B -->|No| D{Structure linear?}
    D -->|Yes, max variance| E[PCA / SVD]
    D -->|Yes, independent sources| F[ICA]
    D -->|No, curved manifold| G[Kernel PCA]
    D -->|No, just visualize 2D/3D| H[t-SNE / UMAP]

7.1 — Principal Component Analysis (PCA)

PCA is the workhorse. The intuition: imagine a cloud of points shaped like a flattened, tilted cigar. There is one direction along which the cloud is most stretched out — that direction carries the most information about how points differ. PCA finds that direction, then the next-most-stretched direction at right angles to it, and so on. These directions are the principal components, and projecting onto the first few keeps most of the spread while killing the rest.

Variance maximization. “Information” here means variance — how much the data spreads along a direction. Center the data (subtract the mean of each feature so the cloud sits at the origin). For a unit-length direction \(w\), the variance of the projected points is \(w^\top \Sigma\, w\), where \(\Sigma\) is the covariance matrix of the data. PCA asks: which unit \(w\) maximizes that?

\[\Sigma = \frac{1}{n}\sum_{i=1}^{n} x_i x_i^\top, \qquad \max_{\|w\|=1}\; w^\top \Sigma\, w.\]

In words: the covariance is the average of each centered point’s outer product with itself (how features co-vary); PCA looks for the unit direction along which projected points spread the most. Also written: \(\Sigma = \frac{1}{n} X^\top X\) for the centered data matrix \(X\) (rows = samples), and the variance along \(w\) is \(\operatorname{Var}(Xw) = w^\top \Sigma\, w\).

Throughout this chapter we use the \(1/n\) convention for the covariance (the maximum-likelihood estimate). Many libraries divide by \(1/(n-1)\) (the unbiased estimate) instead; this scales every eigenvalue by the same constant, so it changes the eigenvalues but never the eigenvectors or the explained-variance ratios that we actually use.

Eigenvectors of the covariance. Solving that constrained problem (Lagrange multipliers) gives the clean result \(\Sigma w = \lambda w\) — the best direction is an eigenvector of \(\Sigma\), and the variance it captures is its eigenvalue \(\lambda\). So the recipe is: compute \(\Sigma\), take its eigenvectors, sort by eigenvalue, keep the top \(k\). The first component has the largest \(\lambda\), the second the next-largest (and is automatically orthogonal to the first because \(\Sigma\) is symmetric), and so on.

\[\Sigma w = \lambda w.\]

In words: the principal directions are exactly the eigenvectors of the covariance matrix, and each one’s eigenvalue is the amount of variance it carries. Also written: the full eigendecomposition \(\Sigma = W \Lambda W^\top\) with \(W\) holding the eigenvectors as columns and \(\Lambda = \operatorname{diag}(\lambda_1,\dots,\lambda_D)\).

Worked example — by hand. Take five centered 2-D points that already sum to zero in each coordinate:

\[(-2,-1),\ (-1,-1),\ (0,0),\ (1,1),\ (2,1).\]

Because the mean is already zero, each entry of the covariance is just an average of products over the five points. Add them up coordinate by coordinate:

top-left \(= \tfrac15\sum x_1^2 = \tfrac15(4+1+0+1+4) = \tfrac{10}{5} = 2.0\)
bottom-right \(= \tfrac15\sum x_2^2 = \tfrac15(1+1+0+1+1) = \tfrac{4}{5} = 0.8\)
off-diagonal \(= \tfrac15\sum x_1 x_2 = \tfrac15(2+1+0+1+2) = \tfrac{6}{5} = 1.2\)

So the covariance is

\[\Sigma = \begin{bmatrix} 2.0 & 1.2 \\ 1.2 & 0.8 \end{bmatrix}.\]

Its eigenvalues solve \(\lambda^2 - (\text{trace})\lambda + \det = 0\) with \(\text{trace} = 2.8\) and \(\det = (2.0)(0.8) - (1.2)^2 = 1.6 - 1.44 = 0.16\) (positive, as a real covariance must be):

\[\lambda = \frac{2.8 \pm \sqrt{2.8^2 - 4(0.16)}}{2} = \frac{2.8 \pm \sqrt{7.84 - 0.64}}{2} = \frac{2.8 \pm \sqrt{7.2}}{2}.\]

So \(\lambda_1 \approx \tfrac{2.8 + 2.683}{2} \approx 2.74\) and \(\lambda_2 \approx \tfrac{2.8 - 2.683}{2} \approx 0.058\) — both positive, as they must be. The first component captures

\[\frac{\lambda_1}{\lambda_1 + \lambda_2} \approx \frac{2.74}{2.80} \approx 0.979,\]

about 98% of the variance. The top eigenvector points along roughly \((0.84, 0.55)\) — the up-and-to-the-right diagonal the points clearly follow. So we can describe each point by a single number (its position along that diagonal) and lose almost nothing.

Link to SVD. In practice nobody forms \(\Sigma\) explicitly — it is numerically wasteful and can lose precision. Instead use the Singular Value Decomposition of the centered data matrix \(X\) (rows = samples):

\[X = U S V^\top.\]

In words: any matrix can be written as a rotation, a stretch, and another rotation; for centered data the second rotation’s axes (\(V\)) are exactly the principal directions and the stretches (\(S\)) measure how much variance each carries. Also written: \(\Sigma = \frac{1}{n} X^\top X = \frac{1}{n} V S^2 V^\top\), so the covariance eigenvectors are the columns of \(V\) and \(\lambda_j = s_j^2/n\).

The columns of \(V\) are exactly the principal components, and the singular values \(s_j\) relate to the covariance eigenvalues (under our \(1/n\) convention) by \(\lambda_j = s_j^2 / n\). SVD gives the same answer more stably and works even when \(D > n\). The reduced coordinates (the scores) are \(X V_k = U_k S_k\).

import numpy as np
X = np.array([[-2,-1],[-1,-1],[0,0],[1,1],[2,1]], float)
X -= X.mean(0)                      # center: PCA needs zero mean
U, S, Vt = np.linalg.svd(X, full_matrices=False)
pc1 = Vt[0]                         # first principal component
scores = X @ pc1                    # 1-D coordinate per point
var_explained = S**2 / (S**2).sum()
eigvals = S**2 / len(X)            # lambda_j = s_j^2 / n
assert var_explained[0] > 0.95     # PC1 holds >95% of variance
assert np.all(eigvals > 0)        # valid covariance: all eigenvalues positive
print(pc1.round(2), var_explained.round(3), eigvals.round(3))
# -> pc1 ~ [0.84 0.55], var_explained ~ [0.979 0.021], eigvals ~ [2.74 0.058]

With scikit-learn. In real projects you reach for the library, which centers for you, exposes the explained-variance ratio, and gives an inverse_transform to reconstruct. Note it does not standardize — put a StandardScaler in front when features have different units:

from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# keep enough components to retain 95% of the variance
pipe = make_pipeline(StandardScaler(), PCA(n_components=0.95))
Z = pipe.fit_transform(X_train)            # reduced coordinates
pca = pipe.named_steps["pca"]
print(pca.n_components_, pca.explained_variance_ratio_.round(3))
X_approx = pipe.inverse_transform(Z)       # back-project to original space

Choosing the number of components. Two standard tools. A scree plot charts eigenvalue (or explained-variance ratio) against component index; you keep components up to the “elbow” where the curve flattens. Or pick the smallest \(k\) whose cumulative explained variance crosses a threshold — 90%, 95%, 99% depending on how much loss you tolerate.

\[\text{cumulative}(k) = \frac{\sum_{j=1}^{k}\lambda_j}{\sum_{j=1}^{D}\lambda_j}.\]

In words: the fraction of total spread kept by the top \(k\) components is their eigenvalues added up, divided by all the eigenvalues added up. Also written: \(\frac{\sum_{j\le k} s_j^2}{\sum_j s_j^2}\) in singular values, or \(\frac{\operatorname{tr}(\Lambda_k)}{\operatorname{tr}(\Lambda)}\) since the trace is the total variance.

A real use: eigenfaces. A classic application makes PCA tangible. Take a few thousand \(100\times100\) grayscale face photos — each is a point in \(10{,}000\)-dimensional pixel space. Run PCA and the top components, reshaped back into images, look like ghostly face templates (the eigenfaces): one captures overall lighting, another the nose-vs-cheek contrast, another the smile. Keeping ~150 of them reconstructs any face in the set almost perfectly, so a face that needed \(10{,}000\) numbers now needs \(150\) — a 98% compression that also feeds faster, less-overfit classifiers downstream. The same move powers PCA-based image compression, gene-expression analysis, and noise reduction in sensor data.

Warning

PCA is scale-sensitive: a feature measured in millimetres will dominate one in metres purely because its numbers are bigger. Always standardize (z-score each feature) before PCA unless every feature already shares units and meaning. And always center — forgetting the mean subtraction makes the “principal component” point at the data’s mean, not its spread.

Tip

PCA components are uncorrelated and orthogonal but not necessarily interpretable — a component is a blend of original features. If you need interpretable axes, look at the loadings (entries of \(V\)) to see which original features dominate each component.

7.2 — Kernel PCA & Manifold Learning

Ordinary PCA can only find flat structure — it draws straight axes. But data often lies on a curved surface. Picture points in two concentric rings: an inner ring and an outer ring around it. No straight line separates inner from outer, so linear PCA is helpless. Kernel PCA fixes this with the kernel trick: imagine mapping every point into a much higher-dimensional space where the curved structure becomes flat, run PCA there, but never actually compute the mapping — only the dot products, supplied by a kernel function \(k(x_i, x_j)\).

The mechanics: build the \(n\times n\) kernel matrix \(K_{ij} = k(x_i, x_j)\) (a common choice is the RBF kernel \(k(x,y)=\exp(-\gamma\|x-y\|^2)\)), center it, and take its top eigenvectors. Those eigenvectors directly give the reduced coordinates — you work entirely with similarities between points, never with explicit high-dimensional vectors.

\[k(x,y)=\exp\!\big(-\gamma\,\|x-y\|^2\big).\]

In words: the RBF kernel scores two points as “very similar” (near 1) when they sit close together and “unrelated” (near 0) when they are far apart, with \(\gamma\) setting how quickly similarity drops off with distance. Also written: \(k(x,y)=\exp\!\big(-\|x-y\|^2/(2\sigma^2)\big)\) with bandwidth \(\sigma\), where \(\gamma = 1/(2\sigma^2)\).

Worked example — two tiny rings. Take six points, three on an inner ring of radius \(1\) and three on an outer ring of radius \(4\):

\[\underbrace{(1,0),\ (-0.5,0.87),\ (-0.5,-0.87)}_{\text{inner, }r=1}\qquad \underbrace{(4,0),\ (-2,3.46),\ (-2,-3.46)}_{\text{outer, }r=4}.\]

A linear projection onto any single axis interleaves the two rings — e.g. projecting onto the \(x\)-axis gives inner values \(\{1,-0.5,-0.5\}\) and outer values \(\{4,-2,-2\}\), which overlap in range and cannot be cleanly split by one threshold. The RBF kernel measures closeness: points on the same ring are near each other (kernel near \(1\)), points on different rings are far (kernel near \(0\)). After kernel PCA the inner three collapse to nearly one coordinate and the outer three to another, so a single threshold separates them. The code below builds the kernel, runs KPCA, and asserts that the gap between the two groups on the leading component is large:

import numpy as np
inner = np.array([[1,0],[-0.5,0.87],[-0.5,-0.87]])
outer = np.array([[4,0],[-2,3.46],[-2,-3.46]])
X = np.vstack([inner, outer]); y = np.array([0,0,0,1,1,1])

def rbf_kpca(X, gamma, k=2):
    sq = (X**2).sum(1)
    D2 = sq[:,None] + sq[None,:] - 2*X@X.T          # pairwise squared distances
    K  = np.exp(-gamma*D2)                           # RBF kernel matrix
    n  = len(K); one = np.ones((n,n))/n
    Kc = K - one@K - K@one + one@K@one               # center in feature space
    w, V = np.linalg.eigh(Kc)                         # ascending eigenvalues
    return V[:, ::-1][:, :k]                          # top-k components

# gamma ~ 1/median(D2). Median squared dist here is ~16, so gamma ~ 0.05.
Z = rbf_kpca(X, gamma=0.05, k=1)[:, 0]
inner_vals, outer_vals = Z[y==0], Z[y==1]
gap = abs(inner_vals.mean() - outer_vals.mean())
spread = max(inner_vals.ptp(), outer_vals.ptp())
print("inner:", inner_vals.round(3), "outer:", outer_vals.round(3))
assert gap > spread          # groups separate cleanly on KPCA component 1

The assertion is the self-check: the between-group gap on the first kernel component exceeds the within-group spread, so one cut splits the rings — something no linear axis could do. Notice the bandwidth \(\gamma\) is set from the median pairwise squared distance (\(\gamma \approx 1/\text{median}\)), the standard heuristic; a blind \(\gamma=15\) on these coordinates would push every off-diagonal kernel entry to nearly zero (\(e^{-15\cdot 16}\approx 0\)) and the method would see six isolated points with no structure to find. Bandwidth is everything.

With scikit-learn. The library bundles Kernel PCA, Isomap, LLE, and more behind one consistent API, so swapping methods is a one-line change:

from sklearn.decomposition import KernelPCA
from sklearn.manifold import Isomap, LocallyLinearEmbedding

kpca  = KernelPCA(n_components=2, kernel="rbf", gamma=0.05)
iso   = Isomap(n_neighbors=10, n_components=2)
lle   = LocallyLinearEmbedding(n_neighbors=10, n_components=2)

Z_kpca = kpca.fit_transform(X)   # unfold curved structure in kernel space
Z_iso  = iso.fit_transform(X)    # preserve geodesic (along-surface) distances

Manifold learning is the broader idea: assume the data lies on a low-dimensional manifold (a smooth curved surface) embedded in high-D space, and recover the flat sheet. The iconic example is the Swiss roll — a 2-D sheet rolled up like a pastry. Two points can be close in 3-D (across the gap between layers) yet far apart along the sheet; manifold methods respect the on-sheet distance and “unroll” it:

Methods differ in how they measure “nearby”:

Method	Core idea	Preserves
Kernel PCA	PCA in kernel feature space	global structure (kernel-dependent)
Isomap	shortest-path (geodesic) distances on a neighbor graph	global geodesic distances
LLE	each point = weighted combo of its neighbors	local linear reconstruction
Laplacian Eigenmaps	graph Laplacian eigenvectors	local neighborhoods

The shared move: build a graph connecting each point to its nearest neighbors, then find a low-D layout that respects that graph. Isomap unrolls the Swiss roll by walking distances along the surface (hopping neighbor to neighbor) instead of straight through it — the straight-line distance between the two ends of the roll is small, but the geodesic (path along the sheet) is long, and Isomap uses the long one.

Warning

Kernel PCA has no simple inverse: you cannot easily map a reduced point back to the original space (the pre-image problem). And its results swing hard with the kernel and its bandwidth \(\gamma\) — too large and every point looks isolated, too small and structure smears out. Treat \(\gamma\) as a hyperparameter to tune (the median-distance heuristic is a good starting point), not a constant.

7.3 — Linear Discriminant Analysis (LDA)

PCA is unsupervised — it never looks at labels, so it maximizes total spread even if that spread is useless for telling classes apart. Linear Discriminant Analysis is its supervised cousin: it uses the class labels to find the projection that best separates the classes. The picture: two overlapping blobs. PCA might pick the long axis of the combined cloud; LDA instead picks the axis along which the two blob-centers are far apart and each blob is tight.

LDA formalizes this with two scatter matrices. The between-class scatter \(S_B\) measures how far apart the class means \(\mu_c\) are from the overall mean \(\mu\); the within-class scatter \(S_W\) measures how spread out each class is around its own mean. LDA maximizes the ratio (the Fisher criterion):

\[J(w) = \frac{w^\top S_B\, w}{w^\top S_W\, w}, \qquad S_W = \sum_c \sum_{i \in c}(x_i-\mu_c)(x_i-\mu_c)^\top.\]

In words: find the direction that makes the class centers spread far apart (big numerator) while keeping each class tightly bunched (small denominator) — far-apart-but-tight wins. Also written: signal-to-noise form \(J(w) = \dfrac{\text{between-class variance of } w^\top x}{\text{within-class variance of } w^\top x}\); the optimum solves the generalized eigenproblem \(S_B w = \lambda S_W w\).

Maximizing this leads to the generalized eigenproblem \(S_W^{-1} S_B\, w = \lambda w\). The top eigenvectors are the discriminant directions. A hard limit worth memorizing: with \(C\) classes you get at most \(C-1\) useful directions, because \(S_B\) has rank at most \(C-1\) (it is built from \(C\) mean-differences that sum to zero). So 3 classes give you at most a 2-D LDA projection, no matter how many features you started with.

Worked example. Class A means lie at feature-1 \(\approx\) low, class B at feature-1 \(\approx\) high, but both classes have huge spread along feature-2 (noise). PCA, chasing total variance, would pick feature-2 and smear the classes together. LDA divides by within-class scatter, so it discounts the noisy feature-2 and lines its axis up with feature-1 — projecting onto it leaves the two classes cleanly apart. That division by \(S_W\) is the whole trick: it rewards directions where classes are tight, not just far.

Where this shows up. LDA is a staple in face recognition (the “Fisherfaces” method beats raw eigenfaces by aligning axes to identity differences rather than lighting), in bankruptcy and credit-default scoring (project many financial ratios onto the one axis that best splits “defaults” from “survives”), and as a fast, label-aware front-end before a \(k\)-NN or logistic classifier.

With scikit-learn. LDA doubles as both a dimensionality reducer (transform) and a classifier (predict) — the same fitted object does both:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)   # 3+ classes → up to C-1 dims
Z = lda.fit_transform(X_train, y_train)            # supervised projection
acc = lda.score(X_test, y_test)                    # it also classifies
print(Z.shape, round(acc, 3))

	PCA	LDA
Supervision	unsupervised	supervised (needs labels)
Objective	max total variance	max between/within class ratio
Max components	\(\min(n, D)\)	\(C - 1\)
Best when	denoising, compression	classes are the point

Tip

A common pipeline is PCA then LDA: PCA first to knock \(D\) down and make \(S_W\) invertible (it is singular when features outnumber samples), then LDA on the reduced data to sharpen class separation. PCA cleans, LDA discriminates.

7.4 — t-SNE & UMAP

PCA and LDA are about modeling; t-SNE and UMAP are about seeing — they exist to squash data into 2-D or 3-D so your eyes can spot clusters. Both are nonlinear and both prioritize keeping neighbors together over preserving global distances, which is exactly what makes a scatter plot readable.

t-SNE (t-distributed Stochastic Neighbor Embedding) works in two steps. First, in the original space, it converts distances into probabilities: for each point, nearby points get high probability of being picked as a “neighbor,” far points near zero — a soft notion of who is close, governed by a perplexity setting (roughly, the effective number of neighbors, typically 5–50). Second, it lays points out in 2-D and shuffles them until the neighbor-probabilities in the map match those in the original space, measured by KL divergence. The “t” comes from using a heavy-tailed Student-t distribution in the map, which gives crowded clusters room to breathe and prevents everything collapsing into a blob.

\[\mathrm{KL}(P\,\|\,Q) = \sum_{i\neq j} p_{ij}\,\log\frac{p_{ij}}{q_{ij}}.\]

In words: measure how badly the 2-D map’s neighbor-probabilities \(q_{ij}\) disagree with the original-space ones \(p_{ij}\), and move points to shrink that disagreement. Also written: the per-pair cost \(\sum_{i\neq j} p_{ij}\big(\log p_{ij} - \log q_{ij}\big)\); it is zero only when \(Q\) matches \(P\) exactly and grows as the map distorts neighborhoods.

flowchart LR
    A[High-D points] --> B[Pairwise similarities<br/>as probabilities]
    B --> C[Random 2-D layout]
    C --> D[Minimize KL divergence<br/>via gradient descent]
    D --> E[2-D map: neighbors stay near]

The animation below shows the core move: scattered high-D points (left) drift into tight, color-coded neighborhoods on the 2-D map (right) as the layout settles — neighbors snap together, the rest spreads apart.

UMAP (Uniform Manifold Approximation and Projection) reaches a similar 2-D picture through a different door: it builds a fuzzy nearest-neighbor graph in high-D, then optimizes a low-D graph to match it. In practice UMAP is much faster, scales to millions of points, and tends to preserve more global structure (the relative arrangement of clusters), where t-SNE mostly nails local clusters but scrambles the big picture.

In practice. Both ship as drop-in transformers. A typical workflow visualizes a high-dimensional embedding — say the 768-D vectors a language model produces, or 784-pixel MNIST digits — by reducing to 2-D and plotting:

from sklearn.manifold import TSNE
import umap                                  # pip install umap-learn

# X: e.g. (n_samples, 50) after a PCA pre-reduction
Z_tsne = TSNE(n_components=2, perplexity=30, init="pca").fit_transform(X)
Z_umap = umap.UMAP(n_neighbors=15, min_dist=0.1).fit_transform(X)
# scatter Z_*[:,0] vs Z_*[:,1], colored by label, to eyeball clusters

	t-SNE	UMAP
Speed	slow on large \(n\)	fast, scales to millions
Global structure	poorly preserved	better preserved
Key knob	perplexity	n_neighbors, min_dist
Output stable?	varies by run/seed	more stable

Warning

Read t-SNE and UMAP plots with discipline. Cluster sizes mean nothing — t-SNE inflates dense clusters and shrinks sparse ones. Distances between clusters mean little — two far-apart blobs are not necessarily more different than two near ones. And changing perplexity or n_neighbors can conjure or dissolve clusters, so always view several settings before believing a structure. These are exploration tools, not measurement tools: never feed their 2-D coordinates into a downstream model as features.

Tip

A solid default recipe: PCA down to ~50 dimensions first (kills noise and speeds things up), then t-SNE or UMAP from there to 2-D. Running these on raw 10,000-D data is slow and noisier than running them on a PCA-cleaned 50-D version.

7.5 — Independent Component Analysis (ICA)

PCA finds directions that are uncorrelated; ICA finds directions that are statistically independent — a strictly stronger and different goal. The canonical story is the cocktail-party problem: two people talk at once, two microphones each record a different mixture of both voices. You want to recover the original two voices from the two mixtures. PCA would give you uncorrelated combinations, but uncorrelated is not unmixed. ICA actually separates the speakers.

The model assumes the observed signals \(x\) are linear mixtures of independent sources \(s\):

\[x = A s, \qquad \text{recover } s = W x \ \text{with}\ W \approx A^{-1}.\]

In words: the microphones record unknown blends (\(A\)) of the hidden voices; ICA hunts for an unmixing matrix \(W\) that undoes the blend and hands back the separate voices. Also written: componentwise \(x_i = \sum_j A_{ij}\, s_j\), recovered as \(\hat s = W x\) where \(W\) is the (approximate) inverse of the mixing matrix \(A\).

The deep insight that makes it solvable: a sum of independent signals is more Gaussian than the individual signals (Central Limit Theorem). So to unmix, ICA searches for directions that make the recovered signal as non-Gaussian as possible — maximizing non-Gaussianity (via measures like kurtosis or negentropy) pulls the mixtures apart back into the original independent sources. This is why ICA cannot separate genuinely Gaussian sources: there is no non-Gaussianity to chase.

The picture below shows two sources (a sine and a sawtooth), two microphone signals that are genuine weighted blends of both, and ICA recovering the originals:

Worked sketch. Standard ICA (the FastICA algorithm) runs in two phases. First whiten the data with PCA so features are uncorrelated and unit-variance — this removes the second-order structure PCA already handles. Then rotate the whitened data to maximize non-Gaussianity, one component at a time, until the recovered signals are independent. The code below mixes a sine and a sawtooth, runs the FastICA fixed-point update, and checks that the recovered components correlate strongly with the true sources (up to ICA’s inherent sign/scale ambiguity):

import numpy as np
rng = np.random.default_rng(0)
t = np.linspace(0, 8*np.pi, 600)
s1 = np.sin(t)                                   # source 1: sine
s2 = np.mod(t, 2*np.pi) - np.pi                  # source 2: sawtooth
S = np.vstack([s1, s2]); S -= S.mean(1, keepdims=True)
A = np.array([[0.7, 0.6], [0.5, -0.8]])          # mixing matrix
X = A @ S                                         # 2 mic signals = blends

def whiten(X):                                    # PCA whitening
    cov = np.cov(X)
    d, E = np.linalg.eigh(cov)
    return (E @ np.diag(d**-0.5) @ E.T) @ X
Xw = whiten(X)

def one_unit(X, iters=200):                       # FastICA, one component
    w = rng.standard_normal(X.shape[0]); w /= np.linalg.norm(w)
    for _ in range(iters):
        wx = w @ X
        g, gp = np.tanh(wx), (1 - np.tanh(wx)**2) # nonlinearity g and g'
        w_new = (X * g).mean(1) - gp.mean() * w   # fixed-point update
        w_new /= np.linalg.norm(w_new)
        if abs(abs(w_new @ w) - 1) < 1e-8: break  # converged (direction stable)
        w = w_new
    return w

w1 = one_unit(Xw)
w2 = one_unit(Xw - np.outer(w1, w1) @ Xw)        # deflate: 2nd comp ⟂ 1st
Y = np.vstack([w1 @ Xw, w2 @ Xw])                 # recovered sources
# match each recovered signal to a true source by |correlation|
C = np.abs(np.corrcoef(Y, S)[:2, 2:])
assert C.max(1).min() > 0.95                      # both sources recovered well
print("best |corr| per recovered comp:", C.max(1).round(3))

The assertion is the self-check: each recovered component lines up with one true source at correlation above \(0.95\), so the unmixing worked despite ICA not knowing \(A\).

With scikit-learn. In practice you call FastICA rather than hand-rolling the fixed-point loop — the same whitening and deflation, battle-tested:

from sklearn.decomposition import FastICA

ica = FastICA(n_components=2, whiten="unit-variance", random_state=0)
S_hat = ica.fit_transform(X.T)        # (n_samples, n_sources): recovered signals
A_hat = ica.mixing_                   # estimated mixing matrix A

	PCA	ICA
Goal	uncorrelated, max variance	statistically independent
Components ranked?	yes, by variance	no inherent order
Assumes	nothing about distribution	sources non-Gaussian
Typical use	compress, denoise	unmix sources (audio, EEG)

Tip

Always center and whiten before ICA (FastICA does this for you). Whitening strips out the easy second-order structure so the algorithm can spend all its effort on the hard part — finding the non-Gaussian rotation. Skipping it makes convergence slow and flaky. And if a source really is Gaussian, no amount of tuning will separate it; that is a limit of the math, not a bug to fix.

Warning

ICA has two built-in ambiguities: it cannot recover the original scale of each source (a source and its doubled version mix identically), nor their order — so recovered components come out unranked and arbitrarily scaled. That is fine for separation tasks (the unmixed waveform is what matters) but means you cannot read “component 1 is most important” the way you can with PCA.

7.6 — Autoencoders: Learned Nonlinear Reduction

Everything so far is a fixed recipe — given the data, the answer is computed in one shot. Autoencoders instead learn the reduction by training a neural network, which lets them capture nonlinear structure that PCA misses and scales to images, audio, and text. The intuition is a funnel: data flows into an encoder that squeezes it through a narrow middle layer (the bottleneck, or latent code), then a decoder tries to rebuild the original from that squeezed code. To rebuild well, the network is forced to pack the essential information into the bottleneck — which is exactly a learned low-dimensional representation.

The training objective is just reconstruction error: make the output \(\hat x = g(f(x))\) match the input \(x\).

\[\mathcal{L} = \frac{1}{n}\sum_{i=1}^{n}\big\|x_i - g\big(f(x_i)\big)\big\|^2.\]

In words: push each example through encoder \(f\) then decoder \(g\), and penalize how far the rebuilt version drifts from the original; minimizing this forces the bottleneck to keep what matters. Also written: \(\mathcal{L} = \mathbb{E}_{x}\big[\|x - g(f(x))\|^2\big]\), the expected squared reconstruction loss over the data distribution.

Watch the funnel in motion: a packet of data flows in wide, is squeezed through the narrow bottleneck (the pulse), and re-expands on the way out — anything that does not fit through the waist is lost, so the network learns to send only what matters.

A tiny worked intuition. Say each input is a 3-pixel patch that is always a flat gray ramp: \((a,\ a,\ a)\) for some brightness \(a\). Three numbers, but only one truly varies. A bottleneck of width 1 can store \(a\) and the decoder can rebuild \((a,a,a)\) with zero error — the autoencoder has discovered, on its own, that this data is secretly 1-dimensional. Give it patches that vary in two independent ways and a width-1 code can no longer reconstruct them; you would watch the reconstruction loss refuse to fall until you widen the bottleneck to 2. The narrowest bottleneck that still reconstructs well is the intrinsic dimension.

The deep tie-in: a linear autoencoder with one bottleneck layer and squared-error loss recovers exactly the PCA subspace — same span as the top principal components. Autoencoders are therefore PCA’s nonlinear generalization. Swap the linear layers for deep nonlinear ones and the bottleneck can bend around curved manifolds (the Swiss roll, handwritten digits) that linear PCA would crush. Two important variants you will meet: the denoising autoencoder (feed in corrupted input, ask it to reconstruct the clean original — learns robust features), and the variational autoencoder (VAE), which makes the latent code a probability distribution so you can sample new data from it, bridging reduction and generative modeling.

With PyTorch. A minimal autoencoder is a few lines — two small MLPs and an MSE loss:

import torch, torch.nn as nn

class AutoEncoder(nn.Module):
    def __init__(self, d_in, d_latent=2):
        super().__init__()
        self.encoder = nn.Sequential(nn.Linear(d_in, 64), nn.ReLU(),
                                     nn.Linear(64, d_latent))     # → bottleneck
        self.decoder = nn.Sequential(nn.Linear(d_latent, 64), nn.ReLU(),
                                     nn.Linear(64, d_in))         # → reconstruct
    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z), z

model = AutoEncoder(d_in=784, d_latent=2)        # e.g. MNIST → 2-D code
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

for xb in loader:                                # xb: (batch, 784)
    xhat, z = model(xb)
    loss = loss_fn(xhat, xb)                      # reconstruction error
    opt.zero_grad(); loss.backward(); opt.step()
# model.encoder(X) now gives a learned low-D embedding

Tip

Autoencoders need enough data to train and can overfit — with a wide enough bottleneck they may just memorize, learning the identity function instead of useful structure. Keep the bottleneck genuinely narrow, add noise (denoising) or a regularizer, and prefer PCA when the data is small or the structure is roughly linear. Reach for autoencoders when the manifold is clearly curved and you have the samples to learn it.

7.7 — Random Projection: Dimensionality Reduction Without Looking at the Data

Every method so far studies the data first — PCA computes a covariance, LDA reads labels, autoencoders train for hours. Random projection does something almost insolent: it multiplies the data by a random matrix and walks away. No eigenvectors, no training, no peeking at the structure. The astonishing part is that it works, and there is a theorem that says it must.

Intuition. Imagine you want to flatten a cloud of points from a high-dimensional room onto a much smaller wall, but you do not want pairs of points that were far apart to suddenly collide, or close pairs to fly apart. Picking the best wall is expensive (that is what PCA does). Random projection says: just pick a wall at random. In very high dimensions there is so much room that a random flattening almost always keeps every pairwise distance roughly intact — the geometry survives the squash.

The guarantee is the Johnson–Lindenstrauss (JL) lemma: any \(n\) points in any dimension can be mapped into \(O(\log n / \varepsilon^2)\) dimensions so that all pairwise distances are preserved within a factor of \(1\pm\varepsilon\). Strikingly, the target dimension depends on the number of points and the tolerance \(\varepsilon\) — not on the original dimension \(D\) at all. A million features can collapse to a few thousand random coordinates while every distance stays almost unchanged.

\[(1-\varepsilon)\,\|x_i-x_j\|^2 \;\le\; \|f(x_i)-f(x_j)\|^2 \;\le\; (1+\varepsilon)\,\|x_i-x_j\|^2 , \qquad k \;\ge\; \frac{8\ln n}{\varepsilon^2}.\]

In words: after the random squash \(f\), the distance between any two points changes by at most a factor of \(\varepsilon\) in either direction, and the number of target dimensions \(k\) you need grows only with the log of how many points you have and how tight a tolerance you demand. Also written: \(\big|\,\|f(x_i)-f(x_j)\|^2 / \|x_i-x_j\|^2 - 1\,\big| \le \varepsilon\) for all pairs \(i,j\), with the same sample-count-driven bound \(k = O(\varepsilon^{-2}\log n)\).

The map itself is just \(f(x) = \frac{1}{\sqrt k}\,R\,x\), where \(R\) is a \(k\times D\) matrix of random entries — Gaussian \(\mathcal N(0,1)\), or even sparse \(\{-1,0,+1\}\) values (Achlioptas’s construction) that make the multiply fast and cheap.

When to reach for it. Random projection shines exactly where PCA strains: enormous \(D\) where even computing the covariance is too slow, streaming data where you cannot revisit points, or as a cheap pre-reduction before a heavier method. It is the standard speed trick behind large-scale nearest-neighbor search and many sketching algorithms. The trade-off is honest: it preserves distances on average with high probability but, unlike PCA, makes no attempt to concentrate variance or denoise — it keeps the geometry, not the signal-to-noise.

Worked check — distances really do survive. Project 200-dimensional points down to 50 random dimensions and measure how much pairwise distances drift:

import numpy as np
rng = np.random.default_rng(0)
D, k, n = 200, 50, 300
X = rng.standard_normal((n, D))

R = rng.standard_normal((k, D)) / np.sqrt(k)   # JL random projection
Z = X @ R.T                                     # (n, k) reduced

def pdists(M):                                  # all pairwise distances
    sq = (M**2).sum(1)
    return np.sqrt(np.maximum(sq[:,None] + sq[None,:] - 2*M@M.T, 0))

ratio = pdists(Z)[np.triu_indices(n,1)] / pdists(X)[np.triu_indices(n,1)]
print("distance ratio mean/std:", ratio.mean().round(3), ratio.std().round(3))
assert abs(ratio.mean() - 1) < 0.05            # distances preserved on average

With scikit-learn. The library picks the JL target dimension for you from the sample count and tolerance, and offers both dense Gaussian and sparse variants:

from sklearn.random_projection import GaussianRandomProjection, johnson_lindenstrauss_min_dim

k = johnson_lindenstrauss_min_dim(n_samples=10_000, eps=0.1)   # suggested target dim
rp = GaussianRandomProjection(n_components=k, random_state=0)
Z = rp.fit_transform(X)        # no covariance, no eigendecomposition — just a matmul

Tip

Random projection is the lazy first move on truly massive, high-\(D\) data: it costs one matrix multiply and a fixed memory budget. If results look good, you have saved a PCA; if you need denoising or interpretable axes, fall back to PCA — often on the randomly projected data, which is much smaller and faster to decompose.

7.8 — Choosing a Method in Practice & Evaluating the Result

With seven techniques on the table, the practical question is which one to grab — and once you have a reduced representation, how to tell whether it is any good. This section is the field guide.

Pick by your goal, not by fashion. The single most useful filter is what you are reducing for:

Your goal	Reach for	Why
Compress / denoise / speed up a model	PCA (SVD)	fast, invertible, concentrates variance
Truly huge \(D\), can’t afford PCA	Random projection	one matmul, no eigendecomposition
Maximize class separation (you have labels)	LDA	optimizes between/within class ratio
Unfold a known curved manifold	Kernel PCA, Isomap, LLE	respect along-surface geometry
Make a 2-D picture of clusters	UMAP (or t-SNE)	preserve neighborhoods for the eye
Separate mixed sources (audio, EEG)	ICA	targets independence, not variance
Curved structure + lots of data + reuse	Autoencoder	learns a nonlinear, reusable encoder

flowchart TD
    A[Need to reduce] --> B{Have labels and want separation?}
    B -->|Yes| C[LDA]
    B -->|No| D{Goal is a 2-D picture?}
    D -->|Yes| E[UMAP / t-SNE]
    D -->|No| F{D huge, need it cheap now?}
    F -->|Yes| G[Random projection]
    F -->|No| H{Structure curved?}
    H -->|No| I[PCA / SVD]
    H -->|Yes, lots of data| J[Autoencoder]
    H -->|Yes, modest data| K[Kernel PCA / Isomap]

Two practical splits to keep straight. First, feature selection vs. feature extraction. Everything in this chapter is extraction — it builds brand-new features as combinations of the originals (a principal component is a blend of many inputs). Selection instead keeps a subset of the original columns untouched (drop low-variance features, pick by mutual information, use L1/Lasso). Extraction usually compresses harder; selection keeps features human-readable. Reach for selection when interpretability or audit trails matter; extraction when raw compression and denoising matter.

Second, transform leakage. A reducer like PCA learns parameters (the components) from data, so it must be fit on the training set only and then applied to validation and test — exactly like a scaler. Fitting PCA on the full dataset before splitting leaks test information into training and inflates your scores. In scikit-learn the fix is one line: put the reducer inside a Pipeline so cross-validation re-fits it on each fold.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# PCA is fit fresh on each training fold — no leakage from val/test
pipe = make_pipeline(StandardScaler(), PCA(n_components=0.95), LogisticRegression(max_iter=1000))
scores = cross_val_score(pipe, X, y, cv=5)
print("CV accuracy:", scores.mean().round(3))

How to tell a reduction is good. There is no single number, but three lenses cover most cases:

Reconstruction error (for invertible methods: PCA, autoencoders) — push the reduced code back to the original space and measure \(\|x - \hat x\|\). Low error means little was lost. This is the cleanest, most honest score when it applies.
Downstream task score — the pragmatic test: does a model trained on the reduced features match (or beat, via denoising) the model on the full features? If accuracy holds with 20 dimensions instead of 1,000, the reduction earned its keep.
Trustworthiness / neighbor preservation (for visualization methods) — check what fraction of each point’s near neighbors in 2-D were also near neighbors in high-D. This catches the failure mode where t-SNE or UMAP invents clusters that were not there.

\[\text{explained variance ratio} = \frac{\sum_{j\le k}\lambda_j}{\sum_j \lambda_j}, \qquad \text{reconstruction error} = \frac{1}{n}\sum_i \|x_i - \hat x_i\|^2 .\]

In words: the explained-variance ratio is the share of total spread your kept components account for; the reconstruction error is the average squared gap between each original point and its rebuilt version — high explained variance and low reconstruction error both say “little was thrown away.” Also written: for PCA the two are linked — reconstruction error equals the discarded variance \(\sum_{j>k}\lambda_j\), so \(\text{explained ratio} = 1 - \dfrac{\text{reconstruction error}}{\text{total variance}}\).

A small worked comparison. Suppose 1,000-D data and a classifier that scores 0.90 on the full features. You try three reductions to 50-D and read the lenses: PCA keeps 95% explained variance and the classifier holds 0.90 (cheap win — half the noise, same accuracy). Random projection is faster to compute but the classifier slips to 0.86 (geometry preserved, but no denoising). An autoencoder pushes to 30-D with the classifier at 0.91 (it found curved structure PCA missed) but took an hour to train. The right call depends on your budget: PCA for the easy 90% of cases, the autoencoder only when the extra point of accuracy or the reusable encoder pays for the training time.

Tip

Always benchmark against the no-reduction baseline. If the full-feature model is already fast enough and accurate enough, the laziest correct answer is to skip reduction entirely — its main payoffs (speed, denoising, visualization) only matter when one of them is actually a bottleneck.

7.9 — Quick reference

Term / method	Meaning in one line	When / why
Curse of dimensionality	As \(D\) grows, points get sparse and distances all look equal	The reason reduction exists at all
PCA	Eigenvectors of the covariance = directions of max variance	Default reducer: compress, denoise, speed up
Covariance \(\Sigma = \frac1n X^\top X\)	How features co-vary, after centering	The matrix PCA diagonalizes
SVD \(X = USV^\top\)	Stable way to get PCA without forming \(\Sigma\)	Always prefer over eigendecomposition of \(\Sigma\)
Explained-variance ratio	Share of total spread kept by top \(k\) components	Picking \(k\) (95% / 99% threshold or scree elbow)
Standardize + center	Z-score and zero-mean each feature before PCA	Mandatory unless features share units
Kernel PCA	PCA in an implicit high-D feature space via a kernel	Curved structure a straight projection flattens
RBF bandwidth \(\gamma\)	Sets how fast similarity decays with distance	Start from \(1/\text{median pairwise sq. dist}\)
Manifold learning (Isomap/LLE)	Unfold a low-D surface by along-sheet distances	Swiss-roll-style curved data
LDA	Maximize between-class over within-class scatter	Supervised: when class separation is the goal
\(C-1\) rule	LDA yields at most (#classes \(-1\)) directions	Hard cap on LDA output dimensions
t-SNE / UMAP	Neighbor-preserving 2-D/3-D layout	Visualization only — never as model features
Perplexity / n_neighbors	Effective neighbor count knob	Tune and view several settings before trusting
ICA	Find statistically independent (not just uncorrelated) sources	Unmix audio/EEG; needs non-Gaussian sources
Non-Gaussianity (kurtosis/negentropy)	What ICA maximizes to unmix	Why Gaussian sources can’t be separated
Autoencoder	Neural bottleneck trained on reconstruction error	Curved manifold + enough data + reusable encoder
Reconstruction error \(\frac1n\sum\\|x-\hat x\\|^2\)	Average squared rebuild gap	Honest score for PCA / autoencoders
Random projection (JL)	Random matrix multiply preserves pairwise distances	Huge \(D\), streaming, cheap pre-reduction
Fit on train only	Reducers learn parameters → fit inside a `Pipeline`	Avoid leaking test info, inflated scores

7.10 — Key takeaways

Dimensionality reduction trades a tolerable bit of information for speed, denoising, less overfitting, and the ability to visualize — fighting the curse of dimensionality (in high \(D\), almost all of a region’s volume hugs its boundary, so points grow sparse and equidistant).
PCA: unsupervised, finds orthogonal directions of maximum variance via eigenvectors of the covariance (best computed by SVD); standardize and center first; choose \(k\) with a scree elbow or cumulative-variance threshold. Real covariance matrices are PSD, so every eigenvalue is non-negative.
Kernel PCA / manifold learning bend space to unfold curved structure linear PCA can’t capture (the Swiss roll is the iconic case); powerful but sensitive to kernel/neighbor settings — set the RBF bandwidth from the median pairwise distance — and hard to invert.
LDA is supervised — it maximizes between-class over within-class scatter and yields at most \(C-1\) directions; use it when class separation, not raw variance, is the goal.
t-SNE & UMAP are for visualization only: they preserve neighborhoods, not distances or cluster sizes; never trust absolute geometry and never feed their coordinates into a model.
ICA seeks statistical independence (not mere decorrelation) by maximizing non-Gaussianity, solving the cocktail-party unmixing problem; scale and order of components are ambiguous.
Autoencoders learn the reduction with a neural-network bottleneck; a linear one reproduces PCA, deep ones capture curved manifolds, and the VAE variant turns the latent code into a sampler that bridges into generative modeling.
Random projection ignores the data entirely — a random matrix multiply preserves all pairwise distances within \(1\pm\varepsilon\) (Johnson–Lindenstrauss), with target dimension set by the point count, not \(D\); the go-to cheap reducer for truly huge dimensions.
Choosing a method starts from the goal (compress, separate classes, unfold a manifold, visualize, or unmix); fit reducers on training data only (use a Pipeline to avoid leakage), and judge results by reconstruction error, downstream task score, or neighbor preservation — always against a no-reduction baseline.

7.11 — See also

[Mathematical Foundations] Linear Algebra — eigenvectors, SVD, and projections that underpin PCA, LDA, and ICA.
[Mathematical Foundations] Probability & Statistics — covariance, variance, KL divergence, non-Gaussianity, and the Central Limit Theorem behind ICA.
[The ML Workflow] Data Preprocessing — standardization and whitening that reduction methods depend on.
[Classical Machine Learning] Clustering & Unsupervised Learning — the sibling unsupervised family; reduction often precedes clustering.
[Classical Machine Learning] Classification Algorithms — where LDA doubles as a classifier and reduced features feed downstream models.
[Deep Learning] Generative Models — autoencoders and VAEs as the nonlinear, learned descendants of these reduction ideas.

↪ The thread continues → Chapter 08 · 📈 Regression

With tidy, compact features ready, we fit our first models — starting where machine learning itself started: drawing the best line through the data.

📖 All chapters | ← 06 · 🧹 Data Preprocessing | 08 · 📈 Regression →