Chapter 08 — 🗺️ Unsupervised Learning & Dimensionality Reduction — structure without labels

📖 All chapters | ← 07 · 🌲 Ensembles & Boosting | 09 · 🎯 Model Evaluation & Validation →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧩 Classical Machine Learning

🧠 Deep Learning

⚡ The Transformer Era

💬 Using & Adapting LLMs

🤖 The Agentic Frontier

🛠️ The Practical Toolkit

☁️ Cloud AI Platforms

28 · ☁️ Cloud AI Platforms — deploying foundation models on the hyperscalers

Until now every algorithm we met needed a teacher: Chapter 06’s linear and logistic regressions, Chapter 07’s boosted trees — all of them learned from rows of data stamped with the right answer. This chapter throws the answer key away. We ask a harder, more human question: given a pile of unlabelled points, what structure is hiding in it? We will group points (clustering), squeeze high-dimensional data down to a few directions (dimensionality reduction), and spot the oddballs (anomaly detection) — the same instinct that later lets us learn word vectors from raw text in Chapter 14, and the anomaly work hands straight off to Chapter 09’s treatment of imbalanced evaluation.

📍 Timeline: mid-20th century onward — k-means (1957), hierarchical clustering, and PCA (which dates all the way back to Pearson, 1901) matured into the workhorses for finding patterns when nobody labelled the data; t-SNE (2008) and UMAP (2018) are the modern visualization heirs.

8.1 — k-means clustering

Imagine dropping \(k\) flags on a map, assigning every house to its nearest flag, then moving each flag to the centre of the houses that chose it — and repeating until nobody switches. That is k-means: it partitions \(n\) points into \(k\) groups by alternating between assign and update steps, each one lowering the total squared distance from points to their cluster centre.

Formally it minimizes the within-cluster sum of squares (WCSS), also called inertia:

\[ J = \sum_{i=1}^{n} \lVert x_i - \mu_{c_i} \rVert^2 \]

where \(\mu_{c_i}\) is the centroid of the cluster point \(x_i\) belongs to.

flowchart TD
  A["Pick k initial centroids"] --> B["Assign each point to nearest centroid"]
  B --> C["Recompute each centroid as mean of its points"]
  C --> D{"Any assignment changed?"}
  D -->|"yes"| B
  D -->|"no"| E["Done — clusters stable"]

Here is the whole mechanism in a few lines of numpy — no framework call:

import numpy as np

def kmeans(X, k, iters=100):
    # start: k random points as centroids
    rng = np.random.default_rng(0)
    centroids = X[rng.choice(len(X), k, replace=False)]
    for _ in range(iters):
        # assign: nearest centroid by squared distance
        d = ((X[:, None, :] - centroids[None, :, :])**2).sum(2)
        labels = d.argmin(1)
        # update: move each centroid to the mean of its members
        new = np.array([X[labels == j].mean(0) for j in range(k)])
        if np.allclose(new, centroids):   # converged
            break
        centroids = new
    return labels, centroids

The centroid is just the mean of its cluster — sitting in the middle of its points, not on top of any one of them:

Q: What are the two alternating steps of k-means? The assignment step (each point joins its nearest centroid) and the update step (each centroid moves to the mean of its assigned points). This is a special case of the EM pattern: a hard-assignment expectation followed by a maximization. Each step provably never increases the WCSS, so the algorithm converges.

Q: Does k-means find the global optimum? No. Minimizing WCSS exactly is NP-hard; k-means only reaches a local optimum that depends on where the centroids started. That is why you run it several times with different seeds (n_init in scikit-learn) and keep the run with the lowest inertia.

Q: Why is initialization so important, and what does k-means++ do? Random initial centroids can land two flags inside the same true cluster, giving a bad local minimum. k-means++ spreads the initial centroids out: it picks the first centroid at random, then picks each next one with probability proportional to its squared distance from the nearest existing centroid. Far-apart seeds converge faster and to better solutions.

Q: Why must you scale features before k-means? k-means uses Euclidean distance, so a feature measured in large units (say salary in dollars) dominates one in small units (years of experience). Standardize features (zero mean, unit variance) first, or the clustering just reflects the units. This same caution applies to PCA.

Q: What is the time complexity of k-means, and why does it scale where agglomerative clustering doesn’t? Each iteration compares all \(n\) points against all \(k\) centroids in \(d\) dimensions, so k-means is roughly \(O(n \cdot k \cdot d \cdot \text{iters})\) — linear in the number of points. That is why it happily scales to millions of rows. Hierarchical clustering, by contrast, must compare every pair of clusters, costing \(O(n^3)\) time and \(O(n^2)\) memory — fine for thousands of points, hopeless for millions.

Q: What happens if a cluster ends up with no points assigned to it? You get an empty cluster — its centroid has no members to average, so the mean is undefined. This happens when a centroid gets stranded far from all the data. In practice libraries handle it for you: scikit-learn reinitializes the empty centroid (e.g. to the point farthest from its current cluster) and continues, so \(k\) clusters are preserved.

Q: How do you choose k? Two common heuristics. The elbow method plots inertia against \(k\) and looks for the “elbow” where adding clusters stops helping much. The silhouette score measures, for each point, how much closer it is to its own cluster than the next nearest — averaged, higher is better (range \(-1\) to \(1\)). Both share the same spirit you’ll see again in PCA’s scree plot: find the point of diminishing returns. There is no labelled ground truth, so these are guides, not proofs.

The silhouette of a point combines two distances — \(a\) (mean distance to points in its own cluster) and \(b\) (mean distance to the nearest other cluster):

\[ s = \frac{b - a}{\max(a, b)} \]

A score near \(+1\) means the point sits snugly in its own cluster; near \(0\) it is on a boundary; negative means it was probably assigned to the wrong cluster.

Warning

Interview gotcha: k-means implicitly assumes clusters are roughly spherical and equal-sized (it draws straight Voronoi boundaries). Give it two crescent moons or one tiny cluster next to one huge one and it fails. Reach for DBSCAN or a GMM instead.

8.2 — Hierarchical (agglomerative) clustering

Instead of committing to \(k\) up front, agglomerative clustering builds a family tree: start with every point as its own cluster, then repeatedly merge the two closest clusters until one remains. You read the result off a dendrogram and cut it at whatever height gives the number of clusters you want.

The “closest” depends on the linkage rule — how you measure distance between two groups, not two points:

Linkage	Distance between clusters A and B	Tendency
Single	min distance over all pairs	long, chained clusters
Complete	max distance over all pairs	compact, equal-diameter
Average	mean distance over all pairs	balanced compromise
Ward	increase in within-cluster variance from merging	spherical, like k-means

Q: What is a dendrogram and how do you use it? A dendrogram is the tree of merges, with merge height equal to the distance at which two clusters joined. You “cut” it with a horizontal line: the number of vertical branches it crosses is your number of clusters. This lets you decide \(k\) after seeing the structure, not before.

Q: How does agglomerative clustering differ from k-means? It needs no preset \(k\) and is deterministic (no random init), and it can capture nested structure. The price is cost: naive implementations are \(O(n^3)\) time and \(O(n^2)\) memory, so unlike k-means’ linear scaling it cannot reach millions of points.

Q: When would you pick single linkage vs Ward? Single linkage can trace non-elliptical, stringy shapes but suffers from “chaining” (two real clusters bridged by a few stray points get merged). Ward linkage minimizes variance and gives tidy, roughly spherical clusters — a good default when you expect blob-like groups.

8.3 — DBSCAN: density-based clustering

k-means asks “which centre is nearest?”; DBSCAN asks “is this point in a crowded neighbourhood?” It grows clusters outward from dense regions and labels the sparse leftovers as noise. That single idea buys two things k-means lacks: it finds arbitrarily shaped clusters (crescents, rings) and it needs no preset \(k\).

It has two knobs: eps (the neighbourhood radius) and min_samples (how many points must sit within eps to count as dense). Points are then typed:

Core point — has at least min_samples neighbours within eps.
Border point — within eps of a core point but not itself dense.
Noise point — neither; it belongs to no cluster.

The key mechanism is density-reachability: any two core points that lie within eps of each other are linked, and a cluster is the whole connected chain of mutually reachable core points plus the border points hanging off them. So a single cluster can stretch across an arbitrarily long, winding dense region — points are not classified one at a time in isolation.

flowchart LR
  A["Unvisited point p"] --> B{"≥ min_samples<br/>within eps?"}
  B -->|"yes"| C["Core: link to reachable<br/>core points → one cluster"]
  B -->|"no"| D{"Near a core point?"}
  D -->|"yes"| E["Border: join that cluster"]
  D -->|"no"| F["Noise"]

Q: What are DBSCAN’s two hyperparameters and how do you set eps? eps (radius) and min_samples (density threshold). A standard trick for eps: compute each point’s distance to its \(k\)-th nearest neighbour, sort those distances, and plot them — the “knee” in that curve is a good eps. min_samples is often set near \(2 \times \text{dimensions}\).

Q: What is density-reachability, and how does it form a cluster? A point \(q\) is directly density-reachable from a core point \(p\) if \(q\) lies within eps of \(p\). Chain these links together — core point to core point to core point — and the whole connected blob becomes one cluster. This is why DBSCAN can follow a long crescent: it does not need a centre, just an unbroken trail of dense neighbourhoods.

Q: What is DBSCAN’s biggest advantage over k-means? It finds non-convex clusters of arbitrary shape and automatically labels outliers as noise instead of forcing every point into a cluster. It also discovers the number of clusters from the data rather than requiring \(k\) in advance.

Q: When does DBSCAN struggle? When clusters have very different densities — a single global eps cannot be loose enough for the sparse cluster and tight enough for the dense one at once. It also degrades in high dimensions, where distances concentrate and the notion of a dense neighbourhood breaks down. (HDBSCAN, a variant, relaxes the single-eps limitation.)

8.4 — Gaussian Mixture Models and the EM algorithm

k-means gives every point one hard label. But what if a point sits between two clusters? A Gaussian Mixture Model (GMM) answers softly: it assumes the data was generated by \(k\) overlapping Gaussian “blobs”, and for each point it returns a probability of belonging to each blob. Think of k-means as a GMM where every blob is a perfect sphere and every assignment is forced to 0 or 1.

The density is a weighted sum of Gaussians, with mixing weights \(\pi_k\) that sum to 1:

\[ p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k) \]

The real payoff is geometric. Because each Gaussian carries its own covariance matrix, a GMM can fit tilted, stretched, overlapping ellipses, whereas k-means can only carve space into equal round blobs:

We fit it with Expectation–Maximization (EM), which alternates two steps much like k-means:

flowchart TD
  A["Init means, covariances, weights"] --> B["E-step: compute each point's<br/>responsibility (soft membership)"]
  B --> C["M-step: re-estimate means,<br/>covariances, weights from responsibilities"]
  C --> D{"Log-likelihood<br/>converged?"}
  D -->|"no"| B
  D -->|"yes"| E["Done"]

Q: What is the difference between hard and soft clustering? Hard clustering (k-means) assigns each point to exactly one cluster. Soft clustering (GMM) gives each point a responsibility — a probability for each cluster, e.g. 70% cluster A, 30% cluster B. Soft assignments are useful when clusters overlap or when you want a confidence, not just a label.

Q: What do the E-step and M-step do? The E-step (expectation) fixes the current parameters and computes each point’s responsibility — the posterior probability \(p(\text{cluster } k \mid x)\). The M-step (maximization) fixes those responsibilities and updates each Gaussian’s mean, covariance, and weight to best fit its softly-assigned points. EM provably never decreases the data log-likelihood, so it converges (to a local optimum).

Q: Why is a GMM more flexible than k-means? Because each Gaussian has its own covariance matrix, clusters can be elliptical, rotated, and different sizes — not just equal spheres. With a “full” covariance a GMM captures correlated, stretched clusters that k-means would split incorrectly.

Tip

Intuition: EM is the general pattern behind both GMM and k-means — guess the hidden assignments, fit parameters to them, repeat. k-means is just EM with spherical equal-variance Gaussians and hard (argmax) responsibilities.

8.5 — PCA: principal component analysis

You have data in 100 dimensions but suspect it really lives on a 2D sheet. PCA finds that sheet. Intuitively: rotate the coordinate axes so the first new axis points along the direction of greatest spread in the data, the second along the next-greatest (at right angles to the first), and so on. Keep the first few axes and you have compressed the data while throwing away the least information.

Those directions are the eigenvectors of the covariance matrix, and each one’s eigenvalue is the variance it captures. Equivalently — and more stably in practice — they come from the singular value decomposition (SVD) of the centred data matrix \(X = U\Sigma V^\top\), where the columns of \(V\) are the principal components.

\[ \Sigma_{\text{cov}} = \frac{1}{n-1} X^\top X, \qquad \Sigma_{\text{cov}} v_i = \lambda_i v_i \]

Q: What does PCA actually maximize? The variance captured along each new axis. The first principal component is the unit direction along which the projected data has the largest variance; each subsequent component is the highest-variance direction orthogonal to all the previous ones. Maximizing retained variance is the same as minimizing the squared reconstruction error.

Q: How is PCA related to SVD and eigenvectors? The principal components are the eigenvectors of the covariance matrix, with eigenvalues equal to the variance along each. Computing them via the SVD of the centred data (\(X = U\Sigma V^\top\)) is numerically more stable — you never form \(X^\top X\) — and the right-singular vectors \(V\) give the same components, with singular values \(\sigma_i\) relating to variance by \(\lambda_i = \sigma_i^2/(n-1)\).

Q: How do you choose the number of components? Look at the explained variance ratio \(\lambda_i / \sum_j \lambda_j\) and keep enough components to reach a target cumulative variance (e.g. 95%). A scree plot of eigenvalues — the same “find the elbow” idea you met when picking \(k\) for k-means — often shows where returns flatten out.

Q: When must you standardize before PCA? When features are on different scales or units. PCA chases variance, so a feature with a large numeric range would hijack the first component for no real reason. Standardizing to unit variance (PCA on the correlation matrix rather than covariance) puts every feature on equal footing — skip it only when all features already share meaningful, comparable units.

Q: Is PCA linear, and what’s the consequence? Yes — it can only rotate and project, so it captures linear correlations. Data lying on a curved manifold (a rolled-up sheet) won’t unfold under PCA; that is where nonlinear methods like t-SNE, UMAP, or kernel PCA come in.

Warning

Interview gotcha: PCA components are ranked by variance, not by usefulness for your label. A high-variance direction can be irrelevant noise, and a low-variance direction can be exactly what separates your classes. PCA is unsupervised — it never sees the target.

8.6 — t-SNE and UMAP for visualization

PCA is linear, so a 2D PCA plot often smears tangled high-dimensional structure into a blob. t-SNE and UMAP are nonlinear methods built for one job: produce a 2D or 3D picture where points that were neighbours in high-D stay neighbours on screen. They are visualization tools, not general-purpose preprocessors.

The core idea: define a notion of “neighbour probability” in the original space, then arrange points in 2D so those neighbour relationships are preserved as closely as possible. t-SNE matches pairwise neighbour distributions (minimizing a KL divergence — see Chapter 04); UMAP builds a neighbour graph and lays it out, which is faster and tends to keep more global structure.

Aspect	PCA	t-SNE	UMAP
Linear?	yes	no	no
Preserves	global variance	local neighbourhoods	local + some global
Speed	fast	slow	fast
Main use	compression, preprocessing	visualization	visualization, some preprocessing
Key knob	n_components	perplexity	n_neighbors, min_dist

Q: Why can’t you trust cluster sizes and distances in a t-SNE plot? Because t-SNE distorts space to preserve local neighbourhoods, not global geometry. The size of a blob and the gap between blobs are largely artefacts of the algorithm and its perplexity setting — a tight cluster and a loose one can be drawn the same size. Read t-SNE plots as “these points are neighbours,” never as “this cluster is twice as big” or “these two clusters are far apart.”

Q: What is perplexity in t-SNE? Roughly the effective number of neighbours each point considers (typical range 5–50). Low perplexity emphasizes very local structure and can shatter clusters; high perplexity blends in more global structure. Different perplexities can produce visibly different plots, so it is worth trying a few.

Q: How does UMAP differ from t-SNE in practice? UMAP is usually faster, scales to larger datasets, and tends to preserve more global structure, so relative cluster positions are a bit more meaningful (though still not gospel). Its main knobs are n_neighbors (local vs global balance) and min_dist (how tightly points pack). It can also transform new, unseen points, which classic t-SNE cannot.

Warning

Interview gotcha: Never feed t-SNE/UMAP coordinates into a downstream classifier or run k-means on them as a substitute for real features. They are stochastic, distance-distorting visualizations — the axes have no meaning. Use PCA (or autoencoders) for dimensionality reduction you intend to model on.

8.7 — Anomaly detection

An anomaly is a point that doesn’t fit the pattern the rest of the data follows — a fraudulent transaction, a failing sensor, an intrusion. The unifying intuition: model what “normal” looks like, then flag whatever is unlikely or far from it. Because anomalies are rare and often unlabelled, it is naturally an unsupervised problem.

The methods fall into three families, each with a different definition of “weird”:

Family	“Normal” is defined by	Flag a point when	Examples
Statistical	a fitted distribution	low probability / many σ from the mean	z-score, Gaussian / GMM density
Distance / density	proximity to other points	far from neighbours / in a sparse region	k-NN distance, DBSCAN noise, LOF
Model-based	a learned model of normal data	the model “explains” it poorly	Isolation Forest, one-class SVM, autoencoder reconstruction error

Q: What are the main families of anomaly-detection methods? Statistical (flag points beyond a few standard deviations, or with low probability under a fitted Gaussian/GMM); distance/density-based (DBSCAN’s noise points, or k-NN distance); and model-based like Isolation Forest, which isolates points with random splits — anomalies get isolated in fewer splits because they sit apart. One-class SVM and autoencoder reconstruction error are also common.

Q: Why is accuracy a terrible metric for anomaly detection? Because anomalies are extremely rare — if 0.1% of transactions are fraud, a model that predicts “always normal” scores 99.9% accuracy while catching nothing. You evaluate with precision, recall, and the precision–recall curve / AUPRC instead. This imbalanced-data evaluation is exactly what Chapter 09 takes up next.

Q: How does Isolation Forest work intuitively? It repeatedly splits the data on random features at random thresholds, building short random trees. Normal points sit in dense regions and take many splits to isolate; an anomaly sits alone and gets cut off in just a few. The shorter a point’s average path length to isolation, the more anomalous it is.

Key takeaways

Unsupervised learning finds structure without labels — the two big jobs are clustering (group points) and dimensionality reduction (compress dimensions).
k-means alternates assign/update to minimize within-cluster variance; it needs \(k\) up front, assumes spherical equal clusters, is sensitive to init (use k-means++) and to feature scale (standardize first). It runs in \(O(n \cdot k \cdot d \cdot \text{iters})\) — linear in \(n\), so it scales to millions; empty clusters get reinitialized.
Choose \(k\) with the elbow (inertia) or silhouette score \(s = (b-a)/\max(a,b)\) — guides, not ground truth.
Hierarchical/agglomerative clustering builds a dendrogram you cut later; no preset \(k\), but \(O(n^3)\) cost. DBSCAN is density-based, links density-reachable core points into arbitrarily shaped clusters, labels noise, needs no \(k\), but struggles with varying densities.
GMM + EM gives soft (probabilistic) clustering with elliptical, tilted clusters; k-means is the hard-assignment, spherical special case of EM.
PCA projects onto directions of maximum variance — the eigenvectors of the covariance matrix, best computed via SVD. It is linear and unsupervised; standardize when scales differ, pick components by explained variance (scree plot).
t-SNE / UMAP are nonlinear visualization tools that preserve local neighbourhoods — cluster sizes and inter-cluster distances are not meaningful; don’t model on their output.
Anomaly detection models “normal” and flags the unlikely across statistical, density, and model-based families (Isolation Forest, GMM density, DBSCAN noise); evaluate with precision/recall, never accuracy — the imbalanced-evaluation thread continues in Chapter 09.

📖 All chapters | ← 07 · 🌲 Ensembles & Boosting | 09 · 🎯 Model Evaluation & Validation →