flowchart LR R["R: users × movies<br/>(sparse, has gaps)"] -->|factorize| W["W: users × k<br/>taste factors"] R -->|factorize| H["H: k × movies<br/>movie profiles"] W --> P["R̂ = W·H<br/>(filled in)"] H --> P P --> REC["recommend top<br/>unseen predictions"]
Chapter 01 — 🧮 Linear Algebra
📖 All chapters | 02 · ∂ Calculus & Differentiation →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
Linear algebra is the language machine learning speaks. Almost every model — from a humble line fit to a billion-parameter transformer — is, under the hood, a pile of numbers arranged into vectors and matrices, transformed and combined by a handful of operations. This chapter builds that vocabulary from the ground up: what these objects are, how they move and stretch space, and why decompositions like eigendecomposition and SVD are the workhorses behind dimensionality reduction, recommenders, and stable training. Master this and the rest of the encyclopedia stops looking like magic and starts looking like arithmetic.
🧭 In context: Mathematical Foundations · the data structures and transformations every ML algorithm runs on · the one key idea — data is vectors, models are transformations, and the right basis makes hard problems easy.
💡 Remember this: data points are vectors, every model layer is a matrix that transforms them, and finding the right basis (eigenvectors, SVD, PCA) is what turns a hard problem into simple scaling.
1.1 — Vectors & vector operations
A vector is an ordered list of numbers. Geometrically, picture an arrow from the origin to a point; algebraically, it’s just coordinates like \(\mathbf{v} = [3, 2]\). In ML, a vector is how you encode one thing: a house becomes \([\text{1500 sqft}, 3\,\text{beds}, 2\,\text{baths}]\), a word becomes a 300-dimensional embedding, an image becomes a flattened list of pixel intensities. Each slot is a feature — one measurable axis of the thing.
Two operations generate everything else. Addition is tip-to-tail: add componentwise, \([3,2] + [1,3] = [4,5]\). Scaling (multiplying by a number, a scalar) stretches or flips the arrow: \(2 \cdot [3,2] = [6,4]\). Combine them — scale several vectors and add — and you get a linear combination, the single most important phrase in this chapter. Everything a linear model does is form linear combinations of features.
The dot product measures alignment. Intuition: it answers “how much do these two arrows agree?” — point the same way and it’s big and positive, stand at right angles and it’s zero, point opposite and it goes negative. Mechanically, multiply matching components and sum: \(\mathbf{v}\cdot\mathbf{w} = \sum_i v_i w_i\). For \([3,2]\cdot[1,3] = 3 + 6 = 9\). Its geometric meaning is the keystone:
\[\mathbf{v}\cdot\mathbf{w} = \|\mathbf{v}\|\,\|\mathbf{w}\|\cos\theta\]
In words: the dot product equals the two lengths multiplied together, scaled by how aligned they are (the cosine of the angle between them). Also written: \(\mathbf{v}\cdot\mathbf{w} = \mathbf{v}^\top\mathbf{w} = \sum_i v_i w_i\) — the same number as a row-times-column matrix product or an explicit sum.
So the dot product is positive when vectors point the same way, zero when orthogonal (perpendicular, \(\theta = 90°\)), negative when opposed. This single number powers cosine similarity in search, the score inside every neuron, and the kernels of Chapter 9 (Classification).
Watch the alignment idea move: a fixed reference arrow and a sweeping arrow. When they overlap the dot product is large and positive; at a right angle it passes through zero; when opposed it goes negative.
A small worked cosine similarity to make it concrete: a search query embedding \(\mathbf{q}=[1,0,1]\) against two documents \(\mathbf{d}_1=[1,0,1]\) (identical topic) and \(\mathbf{d}_2=[0,1,0]\) (unrelated). Then \(\mathbf{q}\cdot\mathbf{d}_1 = 2\) with cosine \(=2/(\sqrt2\sqrt2)=1.0\) (perfect match), while \(\mathbf{q}\cdot\mathbf{d}_2 = 0\) with cosine \(=0\) (no overlap). That is literally how a vector search engine ranks results.
import numpy as np
v, w = np.array([3,2]), np.array([1,3])
print(v + w) # [4 5] addition
print(2 * v) # [6 4] scaling
print(v @ w) # 9 dot product
cos = (v @ w) / (np.linalg.norm(v) * np.linalg.norm(w))
print(round(cos, 3)) # 0.832 -> ~33.7 degrees apartThe same cosine similarity in PyTorch and scikit-learn, the form you actually use for embeddings:
import torch
import torch.nn.functional as F
a = torch.tensor([[3., 2.]])
b = torch.tensor([[1., 3.]])
print(F.cosine_similarity(a, b).item()) # 0.832
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity([[3, 2]], [[1, 3]])[0, 0]) # 0.832A neuron computes \(\mathbf{w}\cdot\mathbf{x} + b\) — one dot product. If you understand the dot product as “how much of \(\mathbf{x}\) points along \(\mathbf{w}\)”, you already understand the core arithmetic of every neural network.
🎮 Try it — Vector Addition
🎮 Try it — Vector Subtraction
🎮 Try it — Dot Product
🎮 Try it — Cross Product
1.2 — Matrices & matrix operations
A matrix is a grid of numbers — a stack of vectors. Its shape is rows × columns: a \(2\times 3\) matrix has 2 rows, 3 columns. In ML the convention is a design matrix \(X\) of shape (samples × features): each row is one example, each column one feature. A dataset of 1000 houses with 5 features is a \(1000\times 5\) matrix.
The deep idea: a matrix is a transformation. Multiplying a vector by a matrix moves it to a new location — rotating, stretching, shearing, or projecting space. Matrix–vector multiplication \(A\mathbf{x}\) takes a linear combination of \(A\)’s columns, weighted by the entries of \(\mathbf{x}\). Matrix–matrix multiplication \(AB\) chains two transformations: do \(B\), then \(A\).
The mechanical rule: entry \((i,j)\) of \(AB\) is the dot product of row \(i\) of \(A\) with column \(j\) of \(B\). Shapes must conform — the inner dimensions must match: \((m\times n)(n\times p) = (m\times p)\).
\[\begin{bmatrix} 1 & 2 \\ 0 & 1 \end{bmatrix}\begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 1\cdot3 + 2\cdot4 \\ 0\cdot3 + 1\cdot4 \end{bmatrix} = \begin{bmatrix} 11 \\ 4 \end{bmatrix}\]
In words: each output number is a dot product of one row of the matrix with the input vector — row 1 gives 11, row 2 gives 4. Also written: \(A\mathbf{x} = 3\begin{bmatrix}1\\0\end{bmatrix} + 4\begin{bmatrix}2\\1\end{bmatrix}\) — the output is the input’s entries weighting the matrix’s columns (the linear-combination view).
That matrix is a shear: it slides points horizontally in proportion to their height. The point \([3,4]\) slides to \([11,4]\).
A = np.array([[1,2],[0,1]])
x = np.array([3,4])
print(A @ x) # [11 4] matrix as transformation
B = np.array([[0,-1],[1,0]]) # 90-degree rotation
print(A @ B) # chain: rotate, then shearIn a real model a “layer” is exactly this matrix multiply, applied to a whole batch at once. In PyTorch a linear layer is a matrix of weights times the input batch plus a bias:
import torch
layer = torch.nn.Linear(in_features=784, out_features=128) # a 128x784 weight matrix
x = torch.randn(32, 784) # batch of 32 examples, 784 features each
print(layer(x).shape) # torch.Size([32, 128]) -> one matmul, 32 rows transformedMatrix multiplication is associative (\(A(BC)=(AB)C\)) and distributive, but not commutative: \(AB \neq BA\) in general — rotating then shearing differs from shearing then rotating. The transpose \(A^\top\) flips rows and columns; the identity \(I\) leaves vectors untouched (\(I\mathbf{x}=\mathbf{x}\)), the matrix equivalent of multiplying by 1.
Shape errors are the #1 bug in ML code. Before any multiply, check inner dimensions match. \((32\times 784)(784\times 128)\) works and gives \(32\times 128\); \((32\times 784)(32\times 128)\) throws. Read shapes left to right and cancel the middle.
🎮 Try it — Matrix Multiplication
🎮 Try it — Transpose
1.3 — Matrix inverse & determinant
The inverse \(A^{-1}\) undoes what \(A\) does: \(A^{-1}A = AA^{-1} = I\). If \(A\) rotates space 30°, \(A^{-1}\) rotates it back. Inverses let you solve linear systems: \(A\mathbf{x} = \mathbf{b}\) becomes \(\mathbf{x} = A^{-1}\mathbf{b}\) — the algebra behind the closed-form least-squares solution in §1.9.
The determinant \(\det(A)\) is a single number measuring how much the transformation scales area (2D) or volume (nD). \(\det = 2\) doubles areas; \(\det = 1\) preserves them; \(\det = 0\) collapses space onto a lower dimension. For a \(2\times 2\):
\[\det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc\]
In words: the area-scaling factor of a 2×2 matrix is the main-diagonal product minus the off-diagonal product. Also written: \(\det(A) = \lambda_1\lambda_2\) — the determinant is also the product of the eigenvalues (§1.4), and in \(n\) dimensions \(\det(A)=\prod_i\lambda_i\).
The critical fact: \(A\) is invertible if and only if \(\det(A) \neq 0\). A zero determinant means the transformation squashed space flat — information was destroyed, and you cannot uniquely reverse it. Such a matrix is singular.
Worked example — invert \(A = \begin{bmatrix} 2 & 1 \\ 1 & 1 \end{bmatrix}\). Here \(\det = 2\cdot1 - 1\cdot1 = 1\). The \(2\times2\) inverse formula swaps the diagonal, negates the off-diagonal, divides by \(\det\):
\[A^{-1} = \frac{1}{ad-bc}\begin{bmatrix} d & -b \\ -c & a \end{bmatrix} = \begin{bmatrix} 1 & -1 \\ -1 & 2 \end{bmatrix}\]
A = np.array([[2.,1.],[1.,1.]])
print(np.linalg.det(A)) # 1.0
print(np.linalg.inv(A)) # [[ 1 -1] [-1 2]]
# Solve Ax=b — never actually invert; solve is faster & stabler
b = np.array([5., 3.])
print(np.linalg.solve(A, b)) # [2. 1.]Almost never compute inv(A) to solve \(A\mathbf{x}=\mathbf{b}\). Use np.linalg.solve — it’s faster and far more numerically stable. A near-singular matrix (tiny determinant) makes the inverse explode and amplifies floating-point error; this is ill-conditioning (see §1.8 on the condition number).
🎮 Try it — Inverse
🎮 Try it — Determinant
1.4 — Eigenvalues & eigenvectors
Most vectors get knocked off their line when you apply a matrix. An eigenvector is special: the transformation only stretches it, never rotates it. It stays on its own line. The stretch factor is its eigenvalue \(\lambda\):
\[A\mathbf{v} = \lambda\mathbf{v}, \quad \mathbf{v}\neq\mathbf{0}\]
In words: applying the matrix to this special vector gives back the same direction, just scaled by \(\lambda\) — no rotation, only stretch. Also written: \((A - \lambda I)\mathbf{v} = \mathbf{0}\) — rearranged, the eigenvector lives in the null space of \(A-\lambda I\), which is why a nonzero solution forces \(\det(A-\lambda I)=0\).
Eigenvectors are the natural axes of a transformation — the directions along which it acts as pure scaling. They are everywhere in ML: the principal components of PCA are eigenvectors of the covariance matrix (Chapter 7, Dimensionality Reduction), PageRank is the dominant eigenvector of the web’s link matrix, and the eigenvalues of the loss Hessian govern whether training converges or diverges (Chapter 3, Optimization).
To find them, solve \(\det(A - \lambda I) = 0\) (the characteristic equation). Worked example with \(A = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix}\):
\[\det\begin{bmatrix} 2-\lambda & 1 \\ 1 & 2-\lambda \end{bmatrix} = (2-\lambda)^2 - 1 = \lambda^2 - 4\lambda + 3 = 0 \;\Rightarrow\; \lambda = 3,\,1\]
For \(\lambda=3\): solving \((A-3I)\mathbf{v}=0\) gives \(\mathbf{v}=[1,1]\). For \(\lambda=1\): \(\mathbf{v}=[1,-1]\). So this matrix stretches by 3× along the diagonal \([1,1]\) and leaves the anti-diagonal \([1,-1]\) unchanged.
A = np.array([[2.,1.],[1.,2.]])
vals, vecs = np.linalg.eig(A)
print(vals) # [3. 1.]
print(vecs) # columns are eigenvectors (unit length)The animation below makes “stays on its own line” literal: the blue ordinary vector gets swung off its direction by the matrix, while the green eigenvector only pulses longer and shorter along its fixed line — direction locked, length scaling by \(\lambda\).
A symmetric matrix (like every covariance matrix) always has real eigenvalues and orthogonal eigenvectors — the spectral theorem — which is exactly why PCA’s axes come out perpendicular.
Read eigenvalues as “how much” and eigenvectors as “in which direction”. A large eigenvalue marks a direction of large variance (keep it in PCA); a near-zero eigenvalue marks a direction you can safely discard.
🎮 Try it — Eigenvalues
🎮 Try it — Eigenvectors
🎮 Try it — Spectral Decomposition
1.5 — Singular Value Decomposition (SVD)
Eigendecomposition only works for square matrices. SVD generalizes it to any \(m\times n\) matrix and is, arguably, the most useful factorization in all of applied math. It says any matrix can be written as three simple steps — rotate, scale, rotate:
\[A = U\Sigma V^\top\]
In words: any matrix’s action splits into “rotate the input, stretch along clean axes, rotate the output.” Also written: \(A = \sum_{i} \sigma_i\,\mathbf{u}_i\mathbf{v}_i^\top\) — a sum of rank-1 outer products, each weighted by its singular value (truncating this sum is exactly low-rank approximation).
\(U\) (\(m\times m\)) and \(V\) (\(n\times n\)) are orthogonal (pure rotations/reflections, columns are perpendicular unit vectors). \(\Sigma\) is diagonal, holding the singular values \(\sigma_1 \geq \sigma_2 \geq \cdots \geq 0\) — the scaling factors, sorted biggest-first. Geometrically, \(A\) takes the unit circle, rotates it by \(V^\top\), stretches it into an ellipse with axis lengths \(\sigma_i\), then rotates by \(U\).
Here is that “stretch the circle into an ellipse” step animated — the unit circle breathing out along two clean axes by the singular values \(\sigma_1,\sigma_2\), which is the only thing \(\Sigma\) does between the two rotations.
The killer application is low-rank approximation. Keep only the top \(k\) singular values and their vectors, and you get the best possible rank-\(k\) approximation of \(A\) (the Eckart–Young theorem). This is the engine of compression, denoising, PCA, latent semantic analysis, and the recommenders of §1.7.
A = np.array([[3.,1.,1.],[1.,3.,1.]])
U, s, Vt = np.linalg.svd(A, full_matrices=False)
print(s) # singular values, descending
# Rank-1 reconstruction: keep only the largest singular value
A1 = s[0] * np.outer(U[:,0], Vt[0])
print(np.round(A1, 2)) # best rank-1 approximation of AIn practice, for large sparse matrices you reach for scikit-learn’s truncated SVD, which computes only the top \(k\) factors (the engine behind latent semantic analysis on a term–document matrix):
from sklearn.decomposition import TruncatedSVD
import numpy as np
X = np.random.default_rng(0).random((1000, 300)) # 1000 docs x 300 terms
svd = TruncatedSVD(n_components=10) # keep top 10 latent factors
Z = svd.fit_transform(X) # (1000, 10) compressed reps
print(Z.shape, svd.explained_variance_ratio_.sum()) # fraction of energy keptThe fraction of “energy” kept by \(k\) components is \(\sum_{i\le k}\sigma_i^2 / \sum_i \sigma_i^2\) — often you keep 95% of the energy with a tenth of the components. SVD also defines the pseudo-inverse that solves least squares even for non-square or singular matrices (§1.9).
If you learn one decomposition, learn SVD. It always exists, it’s numerically stable, and PCA, recommenders, image compression, and the rank of a matrix all fall straight out of it.
🎮 Try it — SVD
1.6 — QR, LU & Cholesky decomposition
These three factorizations are the computational backbone — the routines that actually run when you call solve. Each rewrites a matrix as a product of simpler matrices that are cheap to work with.
LU decomposition factors \(A = LU\) into a Lower-triangular and Upper-triangular matrix — essentially Gaussian elimination, bookkept. Triangular systems solve instantly by forward/back substitution, so once you have \(LU\) you can solve \(A\mathbf{x}=\mathbf{b}\) for many different \(\mathbf{b}\) cheaply. This is the default workhorse behind np.linalg.solve.
QR decomposition factors \(A = QR\) where \(Q\) is orthogonal (perpendicular unit columns) and \(R\) is upper-triangular. Because orthogonal matrices preserve length and never amplify error, QR is the numerically stable way to solve least-squares problems and the core step of eigenvalue algorithms.
Cholesky decomposition is the specialist: for a symmetric positive-definite matrix (all eigenvalues positive — covariance matrices qualify), \(A = LL^\top\). It’s roughly twice as fast as LU and underpins Gaussian processes, Kalman filters, and sampling from multivariate Gaussians.
Worked Cholesky on small numbers — factor \(A = \begin{bmatrix} 4 & 2 \\ 2 & 3 \end{bmatrix}\). You read \(L\) off entry by entry: \(\ell_{11}=\sqrt4=2\); \(\ell_{21}=2/\ell_{11}=1\); \(\ell_{22}=\sqrt{3-1^2}=\sqrt2\). So \(L=\begin{bmatrix}2 & 0 \\ 1 & \sqrt2\end{bmatrix}\), and indeed \(LL^\top=A\). Where this shows up: to draw a sample from a Gaussian with covariance \(A\), take standard normal noise \(\mathbf{z}\) and compute \(L\mathbf{z}\) — Cholesky is the cheap machine that “colors” white noise with the right correlations.
| Decomposition | Form | Requires | Main use |
|---|---|---|---|
| LU | \(A=LU\) | square | general \(A\mathbf{x}=\mathbf{b}\) |
| QR | \(A=QR\) | any (m≥n) | least squares, eigenvalues |
| Cholesky | \(A=LL^\top\) | symmetric pos.-def. | covariance, GPs, sampling |
from numpy.linalg import qr, cholesky
A = np.array([[4.,2.],[2.,3.]]) # symmetric positive-definite
L = cholesky(A)
print(np.round(L @ L.T, 1)) # reconstructs A
Q, R = qr(np.array([[1.,1.],[1.,2.],[1.,3.]]))
print(Q.shape, R.shape) # (3,2) (2,2) — thin QRCholesky throws if the matrix isn’t positive-definite — and that failure is a feature. It’s the cheapest test that a covariance estimate is valid; a crash means your matrix has a negative or zero eigenvalue and needs regularization (add \(\epsilon I\)).
🎮 Try it — QR Decomposition
🎮 Try it — Positive Definite Matrices
1.7 — Matrix factorization (and its use in ML / recommenders)
Matrix factorization is the general move of writing one matrix as a product of two thinner ones: \(R \approx W H\), where \(R\) is \(m\times n\) but \(W\) is \(m\times k\) and \(H\) is \(k\times n\) with \(k\) small. The shared inner dimension \(k\) is a latent space — a compact set of hidden factors that explain the data. This is the same idea as SVD’s low-rank approximation, now used as a model you learn.
The canonical example is a recommender system (Chapter 26). Take the Netflix-style matrix \(R\) of users × movies, mostly empty because nobody watches everything. Factor it into a user matrix \(W\) (each user → \(k\) taste factors) and a movie matrix \(H\) (each movie → the same \(k\) factors). The predicted rating is a dot product: user \(u\)’s taste against movie \(i\)’s profile.
\[\hat{r}_{ui} = \mathbf{w}_u \cdot \mathbf{h}_i = \sum_{f=1}^{k} w_{uf}\,h_{if}\]
In words: the predicted rating is how strongly user \(u\)’s taste vector aligns with movie \(i\)’s feature vector — a dot product over \(k\) hidden factors. Also written: \(\hat{R} = W H\) — stacking all users and movies, every predicted rating is one entry of the matrix product.
The factors are learned, not labeled, but often turn out interpretable — one might capture “action vs. romance”, another “indie vs. blockbuster”. Crucially, you fit \(W\) and \(H\) only on observed ratings, then multiply them out to fill in the blanks for movies a user hasn’t seen. That’s the recommendation.
# Tiny matrix factorization by gradient descent on observed entries
R = np.array([[5,3,0],[4,0,0],[1,1,5]], float) # 0 = unseen
mask = R > 0
k = 2
rng = np.random.default_rng(0)
W = rng.normal(0, .1, (3,k)); H = rng.normal(0, .1, (k,3))
for _ in range(4000):
err = mask * (R - W @ H) # only score observed cells
W += 0.01 * (err @ H.T)
H += 0.01 * (W.T @ err)
print(np.round(W @ H, 1)) # gaps now filled with predictionsThe latent dimension \(k\) is the dial that trades fit against generalization. Too small and the model can’t capture taste; too large and it memorizes noise in the sparse ratings. It’s the same bias–variance knob you’ll meet again in Chapter 12 (Model Evaluation & Tuning).
1.8 — Norms, rank, trace & orthogonality
These four scalars and properties are the everyday measuring tools of linear algebra.
A norm measures the “size” or length of a vector. The L2 (Euclidean) norm \(\|\mathbf{v}\|_2 = \sqrt{\sum v_i^2}\) is ordinary straight-line length. The L1 norm \(\|\mathbf{v}\|_1 = \sum |v_i|\) is the taxicab distance. The distinction is load-bearing in ML: L2 regularization (ridge) shrinks weights smoothly, while L1 regularization (lasso) drives them to exactly zero, producing sparse, feature-selecting models (Chapter 8, Regression).
\[\|\mathbf{v}\|_2 = \sqrt{\textstyle\sum_i v_i^2}, \qquad \|\mathbf{v}\|_1 = \textstyle\sum_i |v_i|\]
In words: the L2 norm is straight-line “as the crow flies” distance; the L1 norm is the total of the absolute steps, like blocks walked in a grid city. Also written: \(\|\mathbf{v}\|_2 = \sqrt{\mathbf{v}^\top\mathbf{v}}\) (square root of the dot product with itself), the special case \(p=2\) of the general \(p\)-norm \(\|\mathbf{v}\|_p = (\sum_i |v_i|^p)^{1/p}\).
The shape of each “unit ball” (all points at distance 1 from the origin) is the visual reason L1 drives weights to zero: the L1 ball is a diamond with sharp corners on the axes, so a solution constrained to it tends to land on a corner where some coordinates are exactly 0; the L2 ball is a smooth circle with no such preference.
The rank of a matrix is the number of genuinely independent directions in it — how many of its columns aren’t just combinations of the others. A \(1000\times 50\) data matrix where 10 features are duplicates of others has rank 40; the other 10 add nothing. Low rank means redundancy, which is exactly what factorization (§1.7) and dimensionality reduction exploit. The trace is the sum of the diagonal entries, \(\text{tr}(A)=\sum_i a_{ii}\); it equals the sum of the eigenvalues, a handy invariant.
Orthogonality means perpendicular — a dot product of zero. A set of vectors that are mutually orthogonal and unit length is orthonormal; stacked into a matrix they form an orthogonal matrix \(Q\) with the beautiful property \(Q^\top Q = I\), so \(Q^{-1}=Q^\top\). Orthogonal transformations rotate without distorting — they preserve lengths and angles, which is why they’re numerically gentle (the \(U\) and \(V\) of SVD, the \(Q\) of QR).
v = np.array([3.,4.])
print(np.linalg.norm(v), np.linalg.norm(v,1)) # 5.0 7.0 (L2, L1)
M = np.array([[1,2,3],[2,4,6],[0,1,0]], float) # row2 = 2*row1
print(np.linalg.matrix_rank(M)) # 2, not 3
print(np.trace(M)) # 1+4+0 = 5A related diagnostic is the condition number — the ratio of largest to smallest singular value. A large condition number flags an ill-conditioned matrix where tiny input changes cause huge output swings, the source of unstable training and untrustworthy inverses.
“Rank” and “number of columns” are not the same. Perfectly correlated features (multicollinearity) make a matrix rank-deficient, which breaks the matrix inverse in plain linear regression and forces a singular-matrix error. Regularization or SVD-based solvers fix it.
🎮 Try it — L1 Norm
🎮 Try it — L2 Norm
🎮 Try it — Rank
🎮 Try it — Trace
🎮 Try it — Orthogonality
1.9 — Projection & subspaces (column/null space, least squares)
A subspace is a flat region through the origin that’s closed under linear combinations — a line, a plane, or a higher-dimensional analog. Two subspaces of a matrix matter most. The column space is everything you can reach by combining the columns — all possible outputs \(A\mathbf{x}\). The null space is every input the matrix crushes to zero, \(A\mathbf{x}=\mathbf{0}\) — the directions of information it throws away.
A projection drops a vector onto a subspace, finding the closest point in it — like the shadow a point casts on a plane. This is the geometric heart of least squares. When \(A\mathbf{x}=\mathbf{b}\) has no exact solution (more equations than unknowns, the usual case in regression), \(\mathbf{b}\) lies off the column space. The best we can do is project \(\mathbf{b}\) onto that column space and solve for the nearest reachable point. That projection condition gives the normal equations:
\[A^\top A\,\hat{\mathbf{x}} = A^\top \mathbf{b} \quad\Longrightarrow\quad \hat{\mathbf{x}} = (A^\top A)^{-1}A^\top \mathbf{b}\]
In words: pick the coefficients \(\hat{\mathbf{x}}\) that make the leftover error perpendicular to every column you fit on — the closest reachable point to \(\mathbf{b}\). Also written: \(\hat{\mathbf{x}} = A^{+}\mathbf{b}\), where \(A^{+} = (A^\top A)^{-1}A^\top\) is the pseudo-inverse (and SVD computes it stably even when \(A^\top A\) is singular).
The residual \(\mathbf{b} - A\hat{\mathbf{x}}\) ends up orthogonal to the column space — there’s no leftover signal pointing in a direction the model could have used. This single formula is linear regression’s closed-form solution (Chapter 8).
# Fit a line y = a*x + b by least squares via the normal equations
x = np.array([1.,2.,3.,4.])
y = np.array([2.1, 3.9, 6.2, 7.8])
A = np.column_stack([x, np.ones_like(x)]) # design matrix [x | 1]
xhat = np.linalg.solve(A.T @ A, A.T @ y) # solve normal equations
print(np.round(xhat, 3)) # [slope, intercept] ≈ [1.92, 0.15]The stable, production way — let the library use QR/SVD instead of forming \(A^\top A\):
# NumPy's lstsq (QR/SVD under the hood) and scikit-learn's LinearRegression
coef, *_ = np.linalg.lstsq(A, y, rcond=None)
print(np.round(coef, 3)) # same [slope, intercept]
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(x.reshape(-1, 1), y)
print(round(reg.coef_[0], 3), round(reg.intercept_, 3)) # 1.92 0.15Forming \(A^\top A\) squares the condition number — it doubles the numerical damage of multicollinearity. Production solvers (and np.linalg.lstsq) skip the normal equations and use QR or SVD on \(A\) directly, which is far more stable. Reach for the formula to understand it, not to ship it.
🎮 Try it — Projection
🎮 Try it — Subspaces
🎮 Try it — Basis
🎮 Try it — Linear Independence
1.10 — Tensors
A tensor is the generalization of the scalar–vector–matrix ladder to any number of dimensions, called the rank or order (the count of indices needed to address an element). A scalar is rank-0, a vector rank-1, a matrix rank-2, and beyond that we just say “tensor”. The word carries deeper meaning in physics, but in ML it’s pragmatic: a tensor is an n-dimensional array, the universal container that flows through TensorFlow and PyTorch (Chapter 31, Tools & Frameworks).
The reason deep learning needs them is that real data has natural multi-axis structure you don’t want to flatten away. A color image is rank-3: \((\text{height}, \text{width}, \text{channels})\). A batch of images for training is rank-4: \((\text{batch}, \text{height}, \text{width}, \text{channels})\). A batch of text sequences fed to a transformer is rank-3: \((\text{batch}, \text{tokens}, \text{embedding})\). Keeping the axes separate lets a convolution slide over height and width while treating channels specially (Chapter 15, Convolutional Neural Networks).
flowchart LR S["scalar<br/>rank 0<br/>7"] --> V["vector<br/>rank 1<br/>shape (3,)"] V --> M["matrix<br/>rank 2<br/>shape (2,3)"] M --> T["tensor<br/>rank 3<br/>shape (2,3,4)"] T --> B["batch tensor<br/>rank 4<br/>(N,H,W,C)"]
The everyday operations are reshaping (rearranging the same numbers into a new shape) and broadcasting (NumPy/PyTorch automatically stretching a smaller array across a larger one so you can add a length-3 bias vector to a whole batch without writing a loop). These are the two moves you’ll perform constantly when wrangling model inputs.
batch = np.zeros((32, 28, 28, 3)) # 32 RGB images, 28×28
print(batch.shape, batch.ndim) # (32,28,28,3) rank 4
flat = batch.reshape(32, -1) # flatten each image to a vector
print(flat.shape) # (32, 2352) -> ready for a dense layer
bias = np.array([1.,2.,3.]) # length-3 channel bias
print((batch + bias).shape) # (32,28,28,3) via broadcastingThe identical moves in PyTorch, where these tensors live on the GPU and carry gradients:
import torch
batch = torch.zeros(32, 3, 28, 28) # PyTorch convention: (N, C, H, W)
flat = batch.reshape(32, -1) # (32, 2352) for a Linear layer
print(flat.shape)
bias = torch.tensor([1., 2., 3.]).reshape(1, 3, 1, 1) # broadcast over N,H,W
print((batch + bias).shape) # torch.Size([32, 3, 28, 28])When confused by a deep-learning error, print .shape. Ninety percent of tensor bugs are a misplaced axis — a batch dimension where a channel should be, or a transpose you forgot. The shape tuple tells you exactly where the numbers went.
🎮 Try it — Tensor Algebra
1.11 — Iterative Solvers and Why We Avoid Inverting Matrices
Suppose you want to solve \(Ax = b\) for \(x\), where \(A\) is a known matrix and \(b\) a known vector. The textbook move is \(x = A^{-1}b\). The intuition feels clean: undo \(A\), recover \(x\). But in real numerical work, forming \(A^{-1}\) is almost always the wrong thing to do. It is the equivalent of computing a full division table to divide one number.
The first problem is cost. Inverting a dense \(n \times n\) matrix costs about \(\frac{2}{3}n^3\) floating-point operations and stores \(n^2\) numbers. For \(n = 100{,}000\) — a small mesh in physics, a mid-sized recommendation graph — that is \(10^{10}\) stored entries and on the order of \(10^{15}\) operations. You will run out of memory long before you run out of patience.
The second problem is that you rarely need the inverse itself. You need its action on a vector: the product \(A^{-1}b\). And the matrices that show up at scale are usually sparse — mostly zeros, like a graph Laplacian where each node touches a handful of neighbors. The inverse of a sparse matrix is generally dense: inverting destroys the very structure that made the problem tractable. A matrix you could store in megabytes becomes one you cannot store at all.
This is the dividing line between two families of methods:
| Direct methods | Iterative methods | |
|---|---|---|
| Examples | LU, QR, Cholesky, explicit inverse | CG, GMRES, MINRES |
| Cost | \(O(n^3)\) dense, fixed | \(O(\text{nnz})\) per step, \(\times\) #steps |
| Output | Exact (up to roundoff) | Approximate, refined each step |
| Needs | The full matrix \(A\) | Only the product \(v \mapsto Av\) |
| Sparsity | Often destroyed (fill-in) | Preserved |
| Best when | \(n\) small-to-medium, dense | \(n\) huge, sparse, or \(A\) implicit |
The last row of “Needs” is the deep one. An iterative solver never asks to see \(A\). It only asks: “given a vector \(v\), what is \(Av\)?” That product is a matrix-free operation — you can supply it as a function. In a neural network, \(A\) might be a Hessian you never materialize; you only know how to multiply by it via a backprop trick. Iterative methods are the only game in town there.
If your code ever computes inv(A) @ b, replace it with a solve: numpy.linalg.solve(A, b) for dense, scipy.sparse.linalg.spsolve or cg/gmres for sparse. The solve is faster, uses less memory, and is more accurate, because it never forms the inverse.
1.12 — Conjugate Gradient: Solving by Rolling Downhill
The conjugate gradient method (CG) solves \(Ax = b\) when \(A\) is symmetric positive definite (SPD) — symmetric, and with all eigenvalues positive. SPD systems are everywhere: least-squares normal equations, covariance matrices, graph Laplacians (plus a shift), the Newton step of a convex problem.
The key reframe: solving \(Ax = b\) with \(A\) SPD is exactly the same as minimizing the bowl-shaped quadratic
\[ f(x) = \tfrac{1}{2}\,x^\top A x - b^\top x. \]
Its gradient is \(\nabla f(x) = Ax - b\), which is zero precisely at the solution. So solving the linear system is finding the bottom of a bowl. The negative gradient \(r = b - Ax\) is the residual — it points downhill and measures how wrong we are.
Picture it: a ball released on the side of the bowl rolls down and settles at the lowest point. That resting point is the \(x\) solving \(Ax=b\).
The naive idea, steepest descent, walks straight downhill each step. On a stretched bowl (an ill-conditioned \(A\)) this zigzags badly, taking tiny crisscrossing steps. CG’s insight: instead of independent downhill steps, choose search directions that are \(A\)-conjugate, meaning \(p_i^\top A p_j = 0\) for \(i \neq j\). Conjugate directions don’t interfere: progress made along one is never undone by the next. In exact arithmetic, \(n\) conjugate steps land exactly at the solution — but in practice you stop far earlier, once the residual is small enough.
flowchart TD
A["Start: x0, r0 = b - A x0, p0 = r0"] --> B["alpha = (r·r) / (p·A p)"]
B --> C["x = x + alpha · p (step along direction)"]
C --> D["r_new = r - alpha · A p (update residual)"]
D --> E{"||r_new|| small?"}
E -->|yes| F["done: x solves A x = b"]
E -->|no| G["beta = (r_new·r_new)/(r·r)"]
G --> H["p = r_new + beta · p (new conjugate direction)"]
H --> B
Notice the loop touches \(A\) exactly once per iteration, in the single product \(Ap\). Everything else is cheap vector arithmetic. That is why CG scales: each step costs \(O(\text{nnz})\), the number of nonzeros.
Here is CG from scratch, solving a tiny SPD system.
import numpy as np
A = np.array([[4., 1.], [1., 3.]]) # symmetric, both eigenvalues > 0
b = np.array([1., 2.])
x = np.zeros(2)
r = b - A @ x # initial residual
p = r.copy() # first search direction
rs = r @ r # squared residual norm
for i in range(10):
Ap = A @ p # the ONE matrix-vector product
alpha = rs / (p @ Ap) # optimal step length along p
x = x + alpha * p # take the step
r = r - alpha * Ap # cheap residual update (no new A x)
rs_new = r @ r
if np.sqrt(rs_new) < 1e-10:
print(f"converged in {i+1} steps")
break
p = r + (rs_new / rs) * p # next A-conjugate direction
rs = rs_new
print("x =", x) # -> [0.0909..., 0.6363...]
print("check A x - b =", A @ x - b)For this \(2 \times 2\) problem CG converges in two steps, matching the theory. How fast CG converges on big problems is governed by the condition number \(\kappa\) (Section 1.95): the error shrinks roughly like \(\left(\frac{\sqrt{\kappa}-1}{\sqrt{\kappa}+1}\right)^k\) after \(k\) steps. A well-conditioned system (\(\kappa\) near 1) converges almost instantly; a badly conditioned one crawls. This is why preconditioning — solving \(M^{-1}Ax = M^{-1}b\) with \(M \approx A\) but easy to invert — is the single most important practical trick in iterative solving.
CG assumes \(A\) is symmetric positive definite. Feed it a nonsymmetric or indefinite matrix and it can divide by a near-zero \(p^\top A p\), stall, or silently return garbage. Check symmetry and positive-definiteness first, or reach for GMRES/MINRES instead.
1.13 — GMRES and MINRES: When the Matrix Isn’t So Nice
CG’s elegance depends on \(A\) being SPD. But plenty of systems aren’t. The discretized equations for fluid flow give nonsymmetric matrices (the flow has a direction). A saddle-point system from constrained optimization is symmetric but indefinite — eigenvalues of both signs. For these, we need solvers that ask less of \(A\).
The shared idea behind both methods is the Krylov subspace. Starting from the residual \(r_0 = b - Ax_0\), repeatedly multiplying by \(A\) generates a sequence of vectors
\[ \mathcal{K}_k = \text{span}\{r_0,\; A r_0,\; A^2 r_0,\; \dots,\; A^{k-1} r_0\}. \]
This is the space of everything you can reach using only matrix-vector products — exactly the operations an iterative method is allowed. Each Krylov method picks the “best” approximate solution from within this growing subspace, for its own definition of best.
flowchart LR
subgraph choose["Which Krylov solver?"]
direction TB
Q1{"A symmetric?"}
Q1 -->|no| GM["GMRES<br/>(general A)"]
Q1 -->|yes| Q2{"positive<br/>definite?"}
Q2 -->|yes| CG["CG<br/>(cheapest)"]
Q2 -->|no, indefinite| MR["MINRES<br/>(symmetric)"]
end
GMRES (Generalized Minimal RESidual) works for any invertible \(A\). At each step it picks, from everything it can currently reach, the guess that makes the leftover error \(\|b - Ax\|_2\) as small as possible. To compare candidates cleanly it keeps a tidy set of perpendicular reference vectors (built by the Arnoldi process, basically Gram-Schmidt on the Krylov vectors). The catch is the bill: GMRES has to remember every one of those reference vectors and line up each new one against all the old ones, so both memory and work climb with every step. The standard fix is restarted GMRES, GMRES(\(m\)): take \(m\) steps, throw the saved vectors away, and start fresh from where you are. A small \(m\) keeps memory flat but can stall — it is a knob you tune.
MINRES (Minimal RESidual) is the specialization for \(A\) symmetric but possibly indefinite. Symmetry lets the Arnoldi process collapse into a short three-term recurrence (the Lanczos process): each new basis vector only needs the previous two, not all of them. So MINRES, like CG, runs in fixed memory per step — no growing storage, no restarts. The rule of thumb is a small decision tree, captured above: symmetric and positive definite leads to CG; symmetric but indefinite leads to MINRES; anything else leads to GMRES.
| Method | Requires | Memory per step | Minimizes |
|---|---|---|---|
| CG | SPD | Constant (few vectors) | \(A\)-norm of error |
| MINRES | Symmetric (any signs) | Constant (3-term recurrence) | Residual \(\|b-Ax\|_2\) |
| GMRES | Any invertible \(A\) | Grows with #steps | Residual \(\|b-Ax\|_2\) |
import numpy as np
from scipy.sparse.linalg import gmres, minres
# nonsymmetric system -> GMRES
A = np.array([[2., 1., 0.],
[0., 3., 1.],
[1., 0., 4.]]) # A != A.T
b = np.array([1., 2., 3.])
x, info = gmres(A, b, rtol=1e-10)
print("GMRES:", x, "residual", np.linalg.norm(A @ x - b))
# symmetric indefinite system -> MINRES
S = np.array([[0., 1.],
[1., 0.]]) # symmetric, eigenvalues +1 and -1
c = np.array([1., 1.])
y, _ = minres(S, c, rtol=1e-10)
print("MINRES:", y, "residual", np.linalg.norm(S @ y - c))GMRES memory grows every iteration. On a large problem that converges slowly, un-restarted GMRES can quietly exhaust RAM. Always set a restart parameter (restart=m in SciPy) on big systems, and pair it with a preconditioner — restarting a poorly-conditioned system without one can stagnate completely.
1.14 — Householder Reflections and Gram-Schmidt: Two Ways to Build QR
QR factorization — writing \(A = QR\) with \(Q\) orthonormal and \(R\) upper-triangular — is the workhorse behind least squares, eigenvalue algorithms, and the Arnoldi/GMRES machinery above. But how you compute \(Q\) matters enormously for numerical accuracy. There are two classic routes, and one of the most instructive cautionary tales in numerical linear algebra lives here.
Gram-Schmidt is the route everyone learns first. Take the columns of \(A\) one at a time, and subtract off their projections onto the directions already fixed, leaving a piece orthogonal to all of them; normalize, repeat. Geometrically: each new column is “straightened” to be perpendicular to the subspace built so far.
The trouble is the classical version (CGS) subtracts all projections using the original vector. In floating point, tiny rounding errors mean the computed vectors drift away from true orthogonality, and the errors compound. The modified version (MGS) fixes this with a one-line change: subtract each projection as you go, updating the working vector before computing the next projection. Mathematically identical, numerically far better.
import numpy as np
def mgs_qr(A):
A = A.astype(float).copy()
m, n = A.shape
Q = np.zeros((m, n))
R = np.zeros((n, n))
for j in range(n):
v = A[:, j].copy()
for i in range(j): # MGS: project against
R[i, j] = Q[:, i] @ v # the UPDATED v,
v = v - R[i, j] * Q[:, i] # not the original column
R[j, j] = np.linalg.norm(v)
Q[:, j] = v / R[j, j]
return Q, R
A = np.array([[1., 1., 0.], [1., 0., 1.], [0., 1., 1.]])
Q, R = mgs_qr(A)
print("orthogonality error:", np.abs(Q.T @ Q - np.eye(3)).max())
print("reconstruction err :", np.abs(Q @ R - A).max())Householder reflections take a completely different, more robust route. A Householder reflection is a matrix \(H = I - 2\,vv^\top / (v^\top v)\) that reflects vectors across a hyperplane. The trick: for any column, you can choose \(v\) so that \(H\) reflects that column onto a single axis — zeroing out every entry below the diagonal in one shot. Apply such reflections column by column and you grind \(A\) down to upper-triangular \(R\); the product of the reflections is \(Q\).
The picture for one reflection: the vector \(x\) is mirrored across a plane so it lands on the first coordinate axis, with its length preserved.
Why prefer Householder? Because reflections are exactly orthogonal by construction — applying them never amplifies error. The resulting \(Q\) is orthonormal to machine precision regardless of how nasty \(A\) is. This is why LAPACK and numpy.linalg.qr use Householder (or the related Givens rotations) under the hood, not Gram-Schmidt.
| Classical GS | Modified GS | Householder | |
|---|---|---|---|
| Idea | Subtract all projections at once | Subtract projections incrementally | Reflect columns onto axes |
| Orthogonality of \(Q\) | Poor (errors compound) | Good | Excellent (machine precision) |
| Cost | \(\sim 2mn^2\) | \(\sim 2mn^2\) | \(\sim 2mn^2 - \tfrac{2}{3}n^3\) |
| Builds \(Q\) column-by-column | Yes | Yes | No (implicit as product) |
| Used in practice | Rarely | Arnoldi/GMRES (cheap, incremental) | LAPACK default |
Classical Gram-Schmidt looks correct and passes on small well-conditioned examples, then silently produces a \(Q\) whose columns are visibly non-orthogonal on nearly-dependent data. If you must build \(Q\) explicitly column-by-column, use modified Gram-Schmidt — or just call a library QR, which uses Householder. Never ship classical GS for production least-squares.
1.15 — Levenberg-Marquardt: Interpolating Between Two Bad Options
Everything so far solves linear systems. But fitting a model to data is usually nonlinear least squares: choose parameters \(\theta\) to minimize the sum of squared residuals
\[ S(\theta) = \sum_i r_i(\theta)^2, \qquad r_i(\theta) = y_i - \text{model}(x_i; \theta), \]
where the model — an exponential decay, a sigmoid, a sum of Gaussians — depends nonlinearly on \(\theta\). There is no closed-form solution; you iterate.
Two classic iterations sit at opposite extremes. Gradient descent steps downhill along \(-\nabla S\). It is robust — it always heads in a sensible direction even when you start far from the answer — but it is slow near the bottom, inching along. Gauss-Newton linearizes the residuals using the Jacobian \(J\) (the matrix of \(\partial r_i / \partial \theta_j\)) and solves the linear least-squares step \(J^\top J\,\delta = -J^\top r\). Near a good solution it converges in a handful of steps, but far away — or when \(J^\top J\) is nearly singular — it can take a wild, divergent leap.
Levenberg-Marquardt (LM) blends the two with a single dial. It solves the damped normal equations
\[ \left(J^\top J + \lambda\, \text{diag}(J^\top J)\right)\delta = -J^\top r. \]
Look at the role of \(\lambda\), the damping parameter:
- \(\lambda \to 0\): the equation becomes Gauss-Newton — fast, trusting the local quadratic model.
- \(\lambda \to \infty\): the \(\lambda\) term dominates, the step shrinks and turns to point straight downhill — it becomes (scaled) gradient descent, small but safe.
LM adapts \(\lambda\) automatically. After each proposed step it checks whether the error actually went down. If yes, it accepts the step and decreases \(\lambda\) (trust the fast Gauss-Newton mode more). If no, it rejects the step and increases \(\lambda\) (back off toward cautious downhill steps), then tries again. It is a self-tuning interpolation between caution and speed.
flowchart TD
A["Start with params theta, damping lambda"] --> B["Compute residuals r and Jacobian J"]
B --> C["Solve (JᵀJ + lambda·diag(JᵀJ)) delta = -Jᵀr"]
C --> D{"Error S(theta+delta)<br/>< S(theta)?"}
D -->|yes: step helped| E["Accept: theta = theta + delta<br/>decrease lambda (trust Gauss-Newton)"]
D -->|no: step hurt| F["Reject step<br/>increase lambda (toward gradient descent)"]
E --> G{"converged?"}
F --> C
G -->|no| B
G -->|yes| H["return theta"]
A concrete fit: recover the rate \(\theta\) in \(y = e^{-\theta x}\) from noisy data, watching \(\lambda\) do its work.
import numpy as np
rng = np.random.default_rng(0)
x = np.linspace(0, 3, 25)
theta_true = 1.4
y = np.exp(-theta_true * x) + 0.02 * rng.standard_normal(x.size)
def resid(t): return y - np.exp(-t * x) # r_i(theta)
def jac(t): return (x * np.exp(-t * x))[:, None] # dr_i/dtheta (n x 1)
theta = np.array([0.2]) # deliberately poor start
lam = 1e-2
for it in range(50):
r = resid(theta)
J = jac(theta)
JtJ = J.T @ J
g = J.T @ r
step = np.linalg.solve(JtJ + lam * np.diag(np.diag(JtJ)), -g)
if np.sum(resid(theta + step)**2) < np.sum(r**2):
theta = theta + step # step helped: accept, trust more
lam *= 0.5
else:
lam *= 2.0 # step hurt: reject, be cautious
if np.linalg.norm(step) < 1e-10:
break
print("recovered theta:", theta[0], " true:", theta_true)In practice you reach for scipy.optimize.least_squares(..., method='lm') or curve_fit, which wrap a battle-tested LM with analytic or finite-difference Jacobians. LM is the default engine behind most curve-fitting you will ever do.
LM finds a local minimum, and nonlinear least squares is full of them. A bad starting guess can land you in the wrong valley. Give the optimizer a sensible initial \(\theta\), scale your parameters so they have comparable magnitudes (the \(\text{diag}(J^\top J)\) damping helps but does not cure bad scaling), and refit from several starts if the fit looks wrong.
1.16 — Condition Number, Stability, and the Tricks That Save a Computation
Every computation above runs in finite-precision arithmetic, where numbers carry roughly 16 significant digits (double precision). Usually that is plenty. Sometimes it is catastrophically not — and the condition number tells you which case you are in before you get burned.
Intuitively, the condition number of a problem measures how much the output can amplify a small wiggle in the input. A problem is well-conditioned if small input changes cause small output changes, and ill-conditioned if a tiny input perturbation can swing the answer wildly. This is a property of the problem, separate from the algorithm used to solve it.
For solving \(Ax = b\), the condition number of \(A\) is
\[ \kappa(A) = \|A\|\,\|A^{-1}\| = \frac{\sigma_{\max}}{\sigma_{\min}}, \]
the ratio of the largest to smallest singular value. The practical reading: if \(\kappa(A) \approx 10^k\), you can lose about \(k\) digits of accuracy in the solution. With 16 digits to start and \(\kappa = 10^{12}\), you keep only about 4 — the rest is noise. A matrix with \(\sigma_{\min} = 0\) is singular (\(\kappa = \infty\)); a matrix with \(\sigma_{\min}\) tiny is nearly singular and almost as dangerous.
import numpy as np
# A nearly-singular matrix: two almost-parallel rows
A = np.array([[1.0, 1.0],
[1.0, 1.0 + 1e-10]])
b = np.array([2.0, 2.0])
print("condition number:", np.linalg.cond(A)) # ~ 4e10
x = np.linalg.solve(A, b)
b2 = b + np.array([0.0, 1e-8]) # wiggle b by 1e-8
x2 = np.linalg.solve(A, b2)
print("tiny input change:", np.linalg.norm(b2 - b)) # 1e-8
print("huge output swing:", np.linalg.norm(x2 - x)) # ~ 100, amplified ~1e10xA near-zero input nudge produces a giant change in \(x\): the condition number \(\approx 4\times10^{10}\) shows up directly as the amplification factor. No algorithm can rescue an ill-conditioned problem — the information is genuinely lost in the input. The cure is reformulation (regularize, rescale, use better-posed equations), not a cleverer solver.
Two everyday tricks fight conditioning damage at the algorithm level.
Pivoting keeps Gaussian elimination stable. When eliminating a column, dividing by a tiny pivot element multiplies every downstream error by a huge factor. Partial pivoting simply swaps rows so the largest-magnitude entry sits on the diagonal before each elimination step — keeping the multipliers \(\le 1\) and the errors bounded. It costs almost nothing, and it is why scipy.linalg.lu and every production solver pivot by default. The lesson: never divide by a small number when a larger one is available by reordering.
Log-sum-exp is the same principle in a different disguise, and it is everywhere in machine learning. The softmax and cross-entropy loss need \(\log \sum_i e^{z_i}\). But \(e^{1000}\) overflows to infinity and \(e^{-1000}\) underflows to zero, even though the answer is a perfectly ordinary number. The fix uses a clean identity: subtract the max \(m = \max_i z_i\) before exponentiating, then add it back.
\[ \log \sum_i e^{z_i} = m + \log \sum_i e^{z_i - m}, \qquad m = \max_i z_i. \]
Now the largest exponent is \(e^0 = 1\) — no overflow — and at least one term is exactly \(1\), so the sum is never zero — no underflow from \(\log(0)\). The result is mathematically identical, numerically bulletproof.
import numpy as np
def logsumexp_naive(z):
return np.log(np.sum(np.exp(z)))
def logsumexp_stable(z):
m = np.max(z)
return m + np.log(np.sum(np.exp(z - m))) # shift down, then back up
z = np.array([1000., 1001., 1002.])
print("naive :", logsumexp_naive(z)) # inf (exp(1000) overflows)
print("stable:", logsumexp_stable(z)) # 1002.407... correctflowchart LR
A["raw scores z<br/>(may be huge/tiny)"] --> B["subtract max:<br/>z - m"]
B --> C["exp: all values in (0, 1]<br/>no overflow"]
C --> D["sum, then log<br/>at least one term = 1, no log(0)"]
D --> E["add m back<br/>exact answer, safe"]
The through-line of this whole section: at scale, how you compute is as important as what you compute. Avoid forming inverses, prefer orthogonal transforms, respect the condition number, and reach for the stabilizing identity (pivot, shift, damp) before the hardware reaches its limits.
A low training loss or a clean residual can hide an ill-conditioned problem — the numbers look fine until the inputs shift slightly and the answer explodes. Check np.linalg.cond on matrices you invert or solve against, watch for singular values spanning many orders of magnitude, and never write your own softmax/cross-entropy without the max-subtraction trick. The library versions (scipy.special.logsumexp, framework log_softmax) already do it.
1.17 — Change of basis & diagonalization
Intuition first: a vector is a fixed arrow in space, but its coordinates depend on the ruler you measure it with. Describe a city block in “north/east” units or in “along-the-street/across-the-street” units and you get different numbers for the same place. A basis is just your chosen set of measuring axes, and change of basis is the act of re-expressing the same vector in a more convenient set of axes. The whole point of eigenvectors and PCA is to find the basis where the problem is easy.
Precisely: if the columns of a matrix \(P\) are a new set of basis vectors, then a vector with coordinates \(\mathbf{x}\) in the standard basis has coordinates \(P^{-1}\mathbf{x}\) in the new basis, and a transformation \(A\) becomes \(P^{-1}AP\) in those new coordinates.
\[A' = P^{-1} A P\]
In words: to apply \(A\) while working in a different coordinate system, translate into that system (\(P^{-1}\)), do the transformation there (\(A\) is now \(A'\)), and translate back (\(P\)). Also written: \(A = P A' P^{-1}\) — solving the same relation the other way; \(A\) and \(A'\) are called similar matrices and share eigenvalues, trace, and determinant.
The magic happens when \(P\)’s columns are the eigenvectors of \(A\). Then \(A' = P^{-1}AP = \Lambda\) is diagonal, holding the eigenvalues. This is diagonalization \(A = P\Lambda P^{-1}\): in the eigenvector basis, the transformation is pure scaling — every messy off-diagonal coupling vanishes.
\[A = P\,\Lambda\,P^{-1}, \qquad \Lambda = \mathrm{diag}(\lambda_1,\dots,\lambda_n)\]
In words: rewrite the matrix in its own eigenvector coordinates and it collapses to a list of stretch factors on the diagonal. Also written: for a symmetric \(A\) the eigenvectors are orthonormal, so \(P\) is orthogonal and \(A = Q\Lambda Q^\top\) (the spectral theorem from §1.4) — the inverse is just a transpose.
Why this matters: powers become trivial. Computing \(A^{10}\) directly is ten matrix multiplies, but in the eigenbasis \(A^{k} = P\Lambda^{k}P^{-1}\), and \(\Lambda^k\) just raises each diagonal eigenvalue to the \(k\). This is the engine behind Markov-chain steady states, PageRank’s repeated link-following, and analyzing whether a recurrent system explodes or settles.
import numpy as np
A = np.array([[2., 1.], [1., 2.]])
vals, P = np.linalg.eig(A) # eigenvalues (Λ) and eigenvectors (P)
Lam = np.diag(vals)
print(np.round(P @ Lam @ np.linalg.inv(P), 3)) # reconstructs A
# Cheap matrix power via the eigenbasis: A^5 = P Λ^5 P⁻¹
A5 = P @ np.diag(vals**5) @ np.linalg.inv(P)
print(np.round(A5, 1)) # same as np.linalg.matrix_power(A, 5)“Find the right basis” is the secret plot of half this chapter. PCA rotates into the basis where features are uncorrelated; diagonalization rotates into the basis where a transformation is pure scaling; SVD does it even for non-square maps. Hard problem? Often the fix is a better set of axes.
1.18 — Gram matrices, kernels & the kernel trick
Here is a small object with outsized importance. Given a set of vectors stacked as the columns of \(X\), the Gram matrix is \(G = X^\top X\) — every entry \(G_{ij}\) is the dot product \(\mathbf{x}_i\cdot\mathbf{x}_j\), the alignment of example \(i\) with example \(j\). Intuition: it is the full table of “how similar is everything to everything else,” built entirely from dot products.
\[G = X^\top X, \qquad G_{ij} = \mathbf{x}_i^\top \mathbf{x}_j\]
In words: the Gram matrix collects the pairwise dot products of a set of vectors into one symmetric grid of similarities. Also written: \(G = \sum_{\text{features}} (\text{column outer products})\); and \(G\) is always symmetric positive semi-definite (\(\mathbf{z}^\top G\mathbf{z} = \|X\mathbf{z}\|^2 \ge 0\)), which is exactly why Cholesky (§1.6) applies to it.
Gram matrices are everywhere: they are the covariance matrix (up to centering and scaling, Chapter 7), they appear as \(A^\top A\) in the normal equations (§1.9), and the style loss in neural style transfer compares the Gram matrices of feature maps to match texture.
The deep payoff is the kernel trick. Many algorithms — ridge regression, SVMs, PCA — touch the data only through dot products \(\mathbf{x}_i\cdot\mathbf{x}_j\), never the raw vectors. So if we secretly replace each dot product with a kernel function \(k(\mathbf{x}_i,\mathbf{x}_j)\) that equals the dot product in some richer, higher-dimensional feature space, we get all the power of that space without ever computing coordinates in it.
\[k(\mathbf{x}_i,\mathbf{x}_j) = \phi(\mathbf{x}_i)^\top \phi(\mathbf{x}_j)\]
In words: a kernel is a similarity score that equals a dot product in some (possibly enormous) transformed space \(\phi(\cdot)\) — but you compute the score directly, never the transform. Also written: the popular RBF/Gaussian kernel \(k(\mathbf{x},\mathbf{y}) = \exp(-\gamma\|\mathbf{x}-\mathbf{y}\|^2)\) corresponds to an infinite-dimensional \(\phi\), yet is one cheap formula.
A tiny worked example shows the trick paying off. Take 1-D points and the kernel \(k(x,y)=(xy+1)^2\). Expanding, \((xy+1)^2 = x^2y^2 + 2xy + 1 = \phi(x)^\top\phi(y)\) with \(\phi(x)=[x^2,\sqrt2\,x,1]\). So computing one cheap product \(k(2,3)=(6+1)^2=49\) is identical to lifting both points into 3-D and taking the dot product there — we got a quadratic feature space for the cost of a multiply and a square.
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_circles
# Two rings: NOT linearly separable in 2-D, trivial with an RBF kernel
X, y = make_circles(n_samples=200, noise=0.05, factor=0.4, random_state=0)
clf = SVC(kernel="rbf", gamma=1.0).fit(X, y) # kernel trick: no explicit lift
print("train accuracy:", clf.score(X, y)) # ~1.0
# The Gram (kernel) matrix the SVM works through, by hand
G = X @ X.T # linear Gram matrix
print(G.shape, np.allclose(G, G.T)) # (200, 200) True (symmetric)Whenever an algorithm can be written using only dot products between examples, it can be “kernelized” — swap the dot product for a kernel and you gain nonlinear power for free. That single observation turns linear regression into kernel ridge regression and the linear separator into the kernel SVM of Chapter 9.
1.19 — Span, basis, linear independence & dimension
Everything in this chapter rests on one quiet foundation: the idea of a vector space and the handful of words that describe its structure. The chapter has used “linear combination”, “subspace”, “rank”, and “basis” freely — this section makes them precise, because they are the grammar the rest of the math is written in.
Intuition first. Think of a few arrows pinned at the origin. The span of those arrows is every place you can reach by scaling and adding them — the whole region they can “fill in”. Two arrows pointing in different directions in a plane span the entire plane: any point is some mix of them. But two arrows pointing along the same line span only that line — the second one is redundant, it tells you nothing new.
That redundancy is exactly linear dependence. A set of vectors is linearly independent when no one of them is a linear combination of the others — each pulls in a genuinely new direction. Formally, \(\{\mathbf{v}_1,\dots,\mathbf{v}_k\}\) are independent when the only way to combine them to zero is the trivial way:
\[c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k = \mathbf{0} \;\;\Longrightarrow\;\; c_1=c_2=\cdots=c_k=0.\]
In words: the only mix of these vectors that cancels out to nothing is the one where every weight is zero — none of them is secretly a copy of the others. Also written: stacking the vectors as columns of a matrix \(V\), independence means \(V\mathbf{c}=\mathbf{0}\) forces \(\mathbf{c}=\mathbf{0}\), i.e. \(V\) has a trivial null space and full column rank.
A basis for a space is a set of vectors that is both independent (no waste) and spanning (reaches everything) — the minimal complete set of axes. The number of vectors in any basis is the dimension of the space, an honest count of its independent directions. The familiar \([1,0]\) and \([0,1]\) are the standard basis for the plane, but \(\{[1,1],[1,-1]\}\) is an equally valid basis — the same plane, different rulers (which is precisely the change-of-basis story of §1.17).
This ties together threads already running through the chapter. The rank of a matrix (§1.8) is the dimension of its column space — the number of independent columns, i.e. the size of a basis for what it can output. Multicollinearity is just linearly dependent feature columns, which is why it collapses the rank and breaks the inverse. The null space (§1.9) is the space of weight combinations that the columns send to zero — nonzero only when the columns are dependent. And finding a good basis — eigenvectors, principal components, singular vectors — is the recurring punchline of §1.5, §1.17, and Chapter 7.
A worked check: are \(\mathbf{v}_1=[1,2,3]\), \(\mathbf{v}_2=[2,4,6]\), \(\mathbf{v}_3=[1,0,1]\) independent? No — \(\mathbf{v}_2 = 2\mathbf{v}_1\), so they collapse to two independent directions. The span is a 2-D plane inside 3-D space, the rank is 2, and a basis is any two of the non-parallel ones.
import numpy as np
V = np.array([[1., 2., 1.],
[2., 4., 0.],
[3., 6., 1.]]) # columns are v1, v2, v3
print(np.linalg.matrix_rank(V)) # 2 -> only 2 independent directions
# v2 is exactly 2*v1: a dependent column. Confirm the dependence:
print(np.allclose(V[:, 1], 2 * V[:, 0])) # True
# Number of independent columns == dimension of the column space (span)
# A basis: keep the linearly independent columns
_, _, Vt = np.linalg.svd(V)
print("nonzero singular directions =", np.linalg.matrix_rank(V)) # 2“How many independent directions does my data really have?” is the question behind rank, PCA’s component count, and an embedding’s effective dimensionality. Features can number in the thousands while the true dimension — the size of a basis that captures them — is far smaller. Spotting that gap is the whole motivation for dimensionality reduction.
🎮 Try it — Span & Basis
1.20 — Outer products & rank-1 structure
The dot product takes two vectors and returns one number — it collapses. Its mirror image, the outer product, takes two vectors and returns a whole matrix — it expands. Intuition: where the dot product asks “how aligned are these?”, the outer product builds the full multiplication table of one vector’s entries against the other’s.
For a column vector \(\mathbf{u}\) of length \(m\) and \(\mathbf{v}\) of length \(n\), the outer product \(\mathbf{u}\mathbf{v}^\top\) is the \(m\times n\) matrix whose \((i,j)\) entry is \(u_i v_j\):
\[\mathbf{u}\mathbf{v}^\top = \begin{bmatrix} u_1 \\ u_2 \\ u_3 \end{bmatrix}\begin{bmatrix} v_1 & v_2 \end{bmatrix} = \begin{bmatrix} u_1v_1 & u_1v_2 \\ u_2v_1 & u_2v_2 \\ u_3v_1 & u_3v_2 \end{bmatrix}\]
In words: every entry of the result is one entry of \(\mathbf{u}\) times one entry of \(\mathbf{v}\) — the full grid of pairwise products. Also written: \((\mathbf{u}\mathbf{v}^\top)_{ij} = u_i v_j\); note \(\mathbf{u}^\top\mathbf{v}\) (a scalar, the dot product) and \(\mathbf{u}\mathbf{v}^\top\) (a matrix, the outer product) are opposite operations — the transpose lands on the other vector.
The crucial property: every outer product has rank 1. All its columns are scalar multiples of the single vector \(\mathbf{u}\) (column \(j\) is just \(v_j\,\mathbf{u}\)), so it carries exactly one independent direction. A rank-1 matrix is the simplest non-trivial matrix there is — one direction in, one direction out, a single scaling between them.
That is why rank-1 outer products are the atoms of the big decompositions. SVD (§1.5) writes any matrix as a weighted sum of rank-1 outer products, \(A = \sum_i \sigma_i\,\mathbf{u}_i\mathbf{v}_i^\top\), and keeping only the largest terms is low-rank approximation. Matrix factorization for recommenders (§1.7) is the same idea: the prediction \(\hat R = WH\) is a sum of \(k\) rank-1 layers, one per latent factor.
Outer products also show up the moment you take a gradient. The gradient of a linear layer’s loss with respect to its weight matrix is an outer product of the input vector and the upstream error vector — which is exactly why a single backprop step updates the whole weight matrix at once (Chapter 14, Neural Networks). Covariance estimates are averaged outer products of centered data, \(\frac{1}{n}\sum_i (\mathbf{x}_i-\bar{\mathbf{x}})(\mathbf{x}_i-\bar{\mathbf{x}})^\top\) — the link to the Gram matrix of §1.18.
import numpy as np
u = np.array([1., 2., 3.])
v = np.array([4., 5.])
M = np.outer(u, v) # 3x2 outer product
print(M) # [[4 5] [8 10] [12 15]]
print(np.linalg.matrix_rank(M)) # 1 -> always rank 1
# A rank-2 matrix is a SUM of two rank-1 outer products
A = np.outer([1,0,1], [1,1]) + np.outer([0,1,0], [2,-1])
print(np.linalg.matrix_rank(A)) # 2
# This is exactly how SVD rebuilds a matrix, one rank-1 layer at a time
B = np.array([[3.,1.,1.],[1.,3.,1.]])
U, s, Vt = np.linalg.svd(B, full_matrices=False)
recon = sum(s[i] * np.outer(U[:, i], Vt[i]) for i in range(len(s)))
print(np.allclose(recon, B)) # TrueRead \(\mathbf{u}\mathbf{v}^\top\) as “one direction repeated, scaled across the grid”. Once you see that SVD, PCA, recommender factorization, and the weight-gradient are all sums of rank-1 outer products, those four topics stop being separate tricks and become one idea: build complicated matrices out of the simplest possible building block.
1.21 — Quick reference
| Term / formula | Meaning in one line | When / why it matters |
|---|---|---|
| Dot product \(\mathbf{v}\cdot\mathbf{w}=\|\mathbf{v}\|\|\mathbf{w}\|\cos\theta\) | How much two vectors align | Cosine similarity, the score inside every neuron, kernels |
| Matrix–vector \(A\mathbf{x}\) | A transformation: linear combo of \(A\)’s columns | One layer of a network applied to a batch |
| Determinant \(\det A\) | Volume-scaling factor; \(0\Rightarrow\) singular | Tells you if \(A\) is invertible / collapses space |
| Inverse \(A^{-1}\) | Undoes \(A\); solves \(A\mathbf{x}=\mathbf{b}\) | Use solve, never form inv (cost + instability) |
| Eigen \(A\mathbf{v}=\lambda\mathbf{v}\) | Directions of pure scaling, factor \(\lambda\) | PCA axes, PageRank, training stability |
| SVD \(A=U\Sigma V^\top\) | Rotate–scale–rotate of any matrix | Best low-rank approx: compression, PCA, recommenders |
| LU / QR / Cholesky | Factor \(A\) into easy-to-solve pieces | LU general, QR stable least-squares, Cholesky for SPD |
| Matrix factorization \(R\approx WH\) | Learn \(k\) latent factors filling a sparse matrix | Recommenders; latent-space models |
| Norms \(\|\mathbf{v}\|_1,\|\mathbf{v}\|_2\) | Taxicab vs. straight-line size | L1 = sparse (lasso), L2 = smooth (ridge) |
| Rank | Count of independent directions | Exposes redundancy / multicollinearity |
| Condition number \(\kappa=\sigma_{\max}/\sigma_{\min}\) | How much error gets amplified | Large \(\kappa\) ⇒ lose digits; regularize or rescale |
| Normal equations \(\hat{\mathbf{x}}=(A^\top A)^{-1}A^\top\mathbf{b}\) | Least-squares fit as a projection | Closed-form linear regression (prefer QR/SVD) |
| Conjugate gradient | Iterative SPD solver, matrix-free | Huge sparse systems where inverting is impossible |
| Tensor (rank-\(n\) array) | Generalized matrix with \(n\) axes | Batched images / sequences in deep learning |
| Gram matrix \(G=X^\top X\) | Table of pairwise dot products | Kernel trick: nonlinear power from dot products |
| Outer product \(\mathbf{u}\mathbf{v}^\top\) | Rank-1 matrix, the atom of decompositions | Building block of SVD, factorization, weight gradients |
| Diagonalization \(A=P\Lambda P^{-1}\) | Re-express \(A\) as pure scaling in eigenbasis | Cheap matrix powers, Markov steady states |
1.22 — Key takeaways
- Data is vectors, models are matrices. Examples are vectors of features; a model layer is a matrix that transforms them. The dot product — alignment of two vectors — is the atomic operation underneath neurons, similarity, and kernels.
- A matrix is a transformation. Multiplication chains transformations; the determinant measures volume scaling and flags non-invertibility (zero ⇒ singular); shapes must conform.
- Eigen/SVD reveal the natural axes. Eigenvectors are directions of pure scaling; SVD (\(A=U\Sigma V^\top\)) extends this to any matrix and yields the best low-rank approximation — the basis of PCA, compression, and recommenders.
- Factorizations are the compute layer. LU solves general systems, QR is the stable least-squares workhorse, Cholesky is the fast path for covariance matrices.
- Norms, rank, and orthogonality are the measuring tools. L1 vs. L2 norms drive sparse vs. smooth regularization; rank exposes redundancy; orthogonal matrices transform without distorting error.
- Least squares is a projection. Regression projects \(\mathbf{b}\) onto the column space; the residual is orthogonal. Prefer QR/SVD solvers over the normal equations for stability.
- Tensors generalize matrices to the n-dimensional arrays deep learning runs on — and
.shapeis your best debugging friend. - Independence, span, and basis are the grammar. A basis is the minimal set of independent directions that spans a space; its size is the dimension, and the matrix rank counts exactly those directions — the same idea behind multicollinearity, null spaces, and dimensionality reduction.
- Rank-1 outer products are the atoms. \(\mathbf{u}\mathbf{v}^\top\) is the simplest matrix; SVD, PCA, recommender factorization, and weight gradients are all just weighted sums of these building blocks.
- The right basis makes hard problems easy. Change of basis and diagonalization (\(A=P\Lambda P^{-1}\)) turn coupled transformations into pure scaling; it is the same plot as PCA and SVD — rotate into friendlier axes.
- Everything similarity-based runs on dot products. The Gram matrix collects pairwise dot products; the kernel trick swaps that dot product for a kernel to buy nonlinear power without leaving linear algebra.
1.23 — See also
- Calculus & Differentiation — gradients and the chain rule that train the transformations defined here.
- Optimization — how eigenvalues of the Hessian and conditioning govern convergence.
- Probability & Statistics — covariance matrices, where symmetric eigendecomposition lives.
- Dimensionality Reduction — PCA built directly on eigenvectors and SVD.
- Regression — the normal equations and least squares put to work.
- Recommender Systems — matrix factorization scaled up to production.
- Neural Networks (Core) — layers as matrix multiplications, in bulk.
↪ The thread continues → Chapter 02 · ∂ Calculus & Differentiation
You now hold the objects of machine learning — vectors and matrices — but they just sit there. What makes them move toward a better answer is the mathematics of change, which the next chapter turns into the engine of all learning.