ML Simplified
Linear Algebra
| Piece | In plain words |
|---|---|
| a , b | the two arrows (lists of numbers) you're combining |
| aᵢ + bᵢ | add the matching slots, one position at a time |
| result | a new arrow — the tip-to-tail diagonal of the two |
🎮 Try it — interactive demo
Stack two arrows tip-to-tail; where you end up is the sum. Order doesn’t matter.
[1,2] + [3,1] = [4,3]. Reference: 3Blue1Brown, Essence of Linear Algebra Ch.1.
Summing gradient contributions across a batch; moving an embedding by a ‘direction’ (king − man + woman ≈ queen).
| Piece | In plain words |
|---|---|
| a − b | subtract matching slots |
| result | the arrow pointing FROM b TO a |
🎮 Try it — interactive demo
The arrow that gets you from one point to another.
[4,3] − [3,1] = [1,2].
Error vectors (prediction − target); displacement between two GPS points; embedding differences that capture relationships.
| Piece | In plain words |
|---|---|
| a₁b₁ + a₂b₂ + … | multiply matching slots, then add them all up |
| ‖a‖ ‖b‖ | the two arrows' lengths multiplied |
| cos θ | how aligned they are: 1 = same way, 0 = perpendicular, −1 = opposite |
| result | one number: how much the two arrows point the same way |
🎮 Try it — interactive demo
A single number measuring how much two vectors point the same way. Big = aligned, 0 = unrelated, negative = opposite.
[1,2]·[3,1] = 1·3 + 2·1 = 5. Cosine similarity is the normalised dot product.
Every neuron computes a dot product of weights and inputs; attention scores are query·key dot products; recommender similarity.
| Piece | In plain words |
|---|---|
| a × b | a NEW arrow at right angles to both inputs (3D only) |
| ‖a‖ ‖b‖ sin θ | its length = area of the parallelogram a and b form |
| direction | perpendicular to the plane of a and b (right-hand rule) |
🎮 Try it — interactive demo
Multiply two 3D arrows to get a third arrow at right angles to both — the ‘normal’ to their plane.
[1,0,0] × [0,1,0] = [0,0,1].
Surface normals in 3D graphics and robotics; torque and angular momentum in physics engines; camera orientation.
| Piece | In plain words |
|---|---|
| |xᵢ| | drop the minus signs — take each number's size |
| sum | add up all those sizes |
| result | total 'city-block' distance |
🎮 Try it — interactive demo
Total distance if you can only move along grid streets, never diagonally.
‖[3,−4]‖₁ = 3 + 4 = 7.
L1 (Lasso) regularisation pushes weights to exactly zero → automatic feature selection; robust error metric (MAE).
| Part | In plain words |
|---|---|
| ▲ Top √( x₁² + x₂² + … ) |
square each number, add them, then take the square root |
| ▼ Bottom (under the root) |
this is just the Pythagorean theorem in n dimensions |
🎮 Try it — interactive demo
The ‘as-the-crow-flies’ length of the arrow.
‖[3,−4]‖₂ = √(9+16) = √25 = 5.
L2 / weight decay keeps weights small and smooth; normalising embeddings to unit length; gradient clipping by norm.
| Part | In plain words |
|---|---|
| ▲ Top a · b |
how much a points along b (their dot product) |
| ▼ Bottom ‖b‖² |
b's length squared — divides out b's size so it's a fair ratio |
🎮 Try it — interactive demo
The shadow one arrow casts on another when light shines straight down on it.
Projecting [2,2] onto the x-axis [1,0] gives [2,0].
PCA projects data onto principal directions; least-squares regression projects targets onto the feature space.
| Piece | In plain words |
|---|---|
| a · b = 0 | their dot product is exactly zero |
| ⟂ | the arrows meet at 90° — perpendicular |
| meaning | they share no common direction at all |
🎮 Try it — interactive demo
Two directions that share no overlap — knowing one tells you nothing about the other.
[1,0]·[0,1] = 0 → the axes are orthogonal.
Orthogonal weight init prevents signal blow-up; orthogonal bases (Fourier, wavelets) for compression; decorrelated features.
| Piece | In plain words |
|---|---|
| row i of A | one horizontal strip from the first matrix |
| column j of B | one vertical strip from the second matrix |
| dot product | multiply them slot-by-slot and sum → cell (i,j) of the answer |
| rule | inner sizes must match: (m×n)(n×p) = (m×p) |
🎮 Try it — interactive demo
Every output cell is a row·column dot product. It glues two linear transformations together.
[[1,2],[3,4]] · [1,1]ᵀ = [3,7]ᵀ.
The heart of deep learning — every dense layer is Wx+b; ~99% of training compute is matrix multiplication (why GPUs exist).
| Piece | In plain words |
|---|---|
| Aⱼᵢ | swap the row and column index of every entry |
| effect | rows become columns, columns become rows (flip across the diagonal) |
🎮 Try it — interactive demo
Turn rows into columns and columns into rows.
[[1,2],[3,4]]ᵀ = [[1,3],[2,4]].
Backprop multiplies by Wᵀ to push gradients backward; Gram matrices XᵀX for covariance and style transfer.
| Piece | In plain words |
|---|---|
| A⁻¹ | the 'undo' matrix |
| A⁻¹A = I | applying it after A returns you to where you started (identity) |
| det A ≠ 0 | only invertible if the matrix doesn't squash space flat |
🎮 Try it — interactive demo
The ‘undo’ matrix — apply it to reverse what A did.
[[2,0],[0,2]]⁻¹ = [[0.5,0],[0,0.5]].
Closed-form regression β̂ = (XᵀX)⁻¹Xᵀy; Kalman filters; in practice we solve systems rather than invert (faster, stabler).
| Piece | In plain words |
|---|---|
| a·d | product of the main diagonal |
| b·c | product of the off-diagonal |
| difference | the signed area/volume scaling factor; 0 = space squashed flat |
🎮 Try it — interactive demo
How much the transformation stretches or shrinks space. Zero means it squashes everything flat.
det [[1,2],[3,4]] = 1·4 − 2·3 = −2.
Normalising flows use log|det J|; checking invertibility; computing Gaussian densities.
| Piece | In plain words |
|---|---|
| independent rows | rows that can't be built from the others |
| count | how many genuinely different directions A can reach |
| low rank | lots of redundancy — the matrix is 'simpler' than its size |
🎮 Try it — interactive demo
How many genuinely different directions the matrix can reach. Low rank = redundancy.
[[1,2],[2,4]] has rank 1 (row 2 = 2 × row 1).
Low-rank factorisation (LoRA!) fine-tunes LLMs with tiny matrices; recommenders factor the ratings matrix; PCA.
| Piece | In plain words |
|---|---|
| Aᵢᵢ | the diagonal entries |
| sum | just add them up |
| = Σ eigenvalues | it also equals the sum of all eigenvalues |
🎮 Try it — interactive demo
Just add up the diagonal entries.
tr [[1,2],[3,4]] = 1 + 4 = 5.
Simplifies gradient derivations (cyclic identities); trace of covariance = total variance; information measures.
| Piece | In plain words |
|---|---|
| Σ cᵢvᵢ = 0 | a weighted combination of the vectors equals the zero vector |
| all cᵢ = 0 | the ONLY way to get zero is to use no vectors at all |
| meaning | none of the vectors is redundant |
🎮 Try it — interactive demo
No vector in the set can be built from the others — none is wasted.
[1,0] and [0,1] are independent; [1,0] and [2,0] are not.
Independent features avoid multicollinearity (unstable coefficients); guarantee a unique least-squares solution.
| Piece | In plain words |
|---|---|
| e₁ … eₙ | the building-block directions (independent + spanning) |
| cᵢ | the unique coordinates of v in that basis |
| any v | every vector in the space can be built this way |
🎮 Try it — interactive demo
The smallest set of building-block directions you can construct everything else from.
Standard basis of 3D space: [1,0,0], [0,1,0], [0,0,1].
Embedding dimensions are a learned basis for meaning; Fourier/wavelet bases for compression; good basis = good features.
| Piece | In plain words |
|---|---|
| u+v ∈ S | add two members → still inside |
| c·u ∈ S | scale a member → still inside |
| always contains 0 | every subspace passes through the origin |
🎮 Try it — interactive demo
A flat ‘slice’ of space — a line or plane through the origin — that stays inside itself.
The x–y plane is a 2D subspace of 3D space.
Data often lives on a low-dimensional subspace (manifold) inside high-dim space — the basis of dimensionality reduction.
| Piece | In plain words |
|---|---|
| A v | apply the matrix to a special vector v |
| λ v | the result is just v scaled by a number λ (no rotation) |
| λ | the eigenvalue — the stretch factor along that direction |
🎮 Try it — interactive demo
The scaling factors along the special directions a matrix doesn’t rotate.
[[2,0],[0,3]] has eigenvalues 2 and 3.
PCA’s eigenvalues = variance per component; PageRank is the top eigenvector; spectral norm bounds training stability.
| Piece | In plain words |
|---|---|
| v | a direction the matrix only stretches, never turns |
| λ | how much it stretches that direction |
| A v = λ v | applying A keeps v on the same line |
🎮 Try it — interactive demo
The directions a transformation only stretches without rotating.
For [[2,0],[0,3]], the eigenvectors are [1,0] and [0,1].
Principal components are eigenvectors of the covariance matrix; eigenfaces for face recognition; vibration modes.
| Piece | In plain words |
|---|---|
| Q | rotation: the orthonormal eigenvectors |
| Λ | scaling: a diagonal of eigenvalues |
| Qᵀ | rotate back |
🎮 Try it — interactive demo
Rewrite a symmetric matrix as rotate, scale each axis, then rotate back.
Covariance matrices are symmetric, so they always decompose this way.
Whitening / decorrelating data; matrix square-roots and powers; the math behind PCA and Gaussian processes.
| Piece | In plain words |
|---|---|
| V ᵀ | first rotation (in the input space) |
| Σ | stretch along axes by the singular values |
| U | second rotation (in the output space) |
🎮 Try it — interactive demo
The universal decomposition: rotate, stretch along axes, rotate again — for ANY matrix.
Keeping the top-k singular values gives the best rank-k approximation (Eckart–Young).
Latent semantic analysis, image compression, Netflix-prize recommenders, and the theory behind low-rank LLM adapters.
| Piece | In plain words |
|---|---|
| Q | orthonormal columns — a clean rotation part |
| R | upper-triangular — the 'bookkeeping' part |
| use | solve A x = b stably without forming A⁻¹ |
🎮 Try it — interactive demo
Split a matrix into a clean rotation part and a triangular bookkeeping part.
Used to solve A x = b stably; the QR algorithm finds eigenvalues.
Numerically stable least-squares; eigenvalue computation; orthogonalising layers in deep nets.
| Piece | In plain words |
|---|---|
| xᵀ A x | a quadratic 'energy' score for any input x |
| > 0 | always positive — the surface curves upward everywhere |
| eigenvalues > 0 | equivalent condition: every stretch factor is positive |
🎮 Try it — interactive demo
A matrix that always curves upward — like a perfect bowl with a single lowest point.
The identity matrix is positive definite; any covariance matrix is at least semi-definite.
Guarantees a unique optimum in convex problems; enables Cholesky factorisation; kernel matrices in SVMs and GPs.
| Piece | In plain words |
|---|---|
| scalar | 0 axes — a single number |
| vector / matrix | 1 / 2 axes |
| tensor | 3+ axes — e.g. a batch of colour images |
| contraction | the einsum generalisation of matrix multiply |
🎮 Try it — interactive demo
Arrays with more than two axes — e.g. a batch of colour images is a 4D tensor.
A 224×224 RGB batch of 32 images has shape (32, 224, 224, 3).
Every DL framework (PyTorch, TensorFlow) is a tensor engine; attention runs on 4D tensors; einsum expresses contractions.
Calculus
| Piece | In plain words |
|---|---|
| x → a | let the input creep toward a |
| f(x) → L | the output settles toward L |
| meaning | the value the function heads to, even if it never arrives |
🎮 Try it — interactive demo
What value a function is heading toward as the input approaches a point.
lim (x→0) sin(x)/x = 1.
Underpins why gradient-descent steps work; learning-rate decay ‘in the limit’; defining derivatives that power backprop.
| Part | In plain words |
|---|---|
| ▲ Top f(x+h) − f(x) |
how much the output changes over a tiny step h |
| ▼ Bottom h |
the size of that tiny step (shrunk toward zero) |
🎮 Try it — interactive demo
How fast the output changes when you nudge the input — the steepness right here.
d/dx (x²) = 2x; at x=3 the slope is 6.
The gradient (vector of derivatives) tells each weight which way and how much to move to cut the loss.
| Part | In plain words |
|---|---|
| ▲ Top ∂f |
tiny change in the output |
| ▼ Bottom ∂xᵢ |
tiny change in ONE input, holding all the rest frozen |
🎮 Try it — interactive demo
Change in the output when you wiggle one input and freeze the rest.
f = x²y → ∂f/∂x = 2xy, ∂f/∂y = x².
Nets have millions of parameters; each weight update uses the partial derivative of the loss w.r.t. that weight.
| Piece | In plain words |
|---|---|
| g′(x) | slope of the inner function |
| f′(g(x)) | slope of the outer function, measured at g(x) |
| multiply | chain the slopes together, outside-in |
🎮 Try it — interactive demo
For nested functions, multiply the slopes of each layer together, outside-in.
d/dx sin(x²) = cos(x²) · 2x.
Backpropagation IS the chain rule applied layer by layer — the single most important formula in deep learning.
| Piece | In plain words |
|---|---|
| each ∂f/∂xᵢ | the slope in one input direction |
| vector | bundle all the slopes into one arrow |
| ∇f | points toward the steepest increase |
🎮 Try it — interactive demo
An arrow pointing toward the fastest increase; step opposite to go downhill.
f = x² + y² → ∇f = [2x, 2y].
Gradient descent moves parameters along −∇Loss; the whole training loop is ‘compute gradient, take a step’.
| Part | In plain words |
|---|---|
| ▲ Top ∂fᵢ |
change in output number i |
| ▼ Bottom ∂xⱼ |
change in input number j — filled in for every (i,j) pair |
🎮 Try it — interactive demo
A table of how every output reacts to every input.
f(x,y) = [x², xy] → J = [[2x, 0],[y, x]].
Normalising flows use log|det J|; sensitivity analysis; vector-Jacobian products power autodiff backprop.
| Part | In plain words |
|---|---|
| ▲ Top ∂²f |
the SECOND change in the output |
| ▼ Bottom ∂xᵢ ∂xⱼ |
with respect to a pair of inputs — how the slope itself bends |
🎮 Try it — interactive demo
Tells you how the slope itself is changing — bowl, dome, or saddle.
f = x² + y² → H = [[2,0],[0,2]] (positive definite → a bowl).
Second-order optimisers (Newton, L-BFGS) use it; its eigenvalues diagnose saddle points that slow training.
| Piece | In plain words |
|---|---|
| f(a) | start from the value at a |
| f′(a)(x−a) | add the slope term (linear correction) |
| ½ f″(a)(x−a)² | add the curvature term (quadratic correction) |
🎮 Try it — interactive demo
Approximate any smooth curve near a point using a polynomial built from its derivatives.
eˣ ≈ 1 + x + x²/2 near 0.
Justifies gradient descent (first-order) and Newton’s method (second-order); solver approximations.
| Piece | In plain words |
|---|---|
| f(x) dx | a thin slice: height × tiny width |
| ∫ₐᵇ | add up infinitely many slices from a to b |
| result | total accumulated area / quantity / probability |
🎮 Try it — interactive demo
Add up infinitely many thin slices to get a total — area, accumulated quantity, or probability.
∫₀¹ x dx = ½.
Expected values and probabilities are integrals; the ELBO in VAEs; diffusion models integrate an SDE to generate images.
| Part | In plain words |
|---|---|
| ▲ Top f(x+h) − f(x−h) |
measure the function a little ahead and a little behind |
| ▼ Bottom 2h |
divide by the total gap between those two points |
🎮 Try it — interactive demo
Estimate a slope by measuring the function at nearby points — no formula needed.
With h=0.01, central difference of x² at 3 ≈ 6.00.
Gradient-checking to verify hand-written backprop; finite-difference sensitivities when analytic gradients are missing.
| Piece | In plain words |
|---|---|
| start θ | an initial guess |
| search direction | which way improves the objective (e.g. −gradient) |
| repeat | keep stepping until it stops improving |
🎮 Try it — interactive demo
Search for the best answer step by step, improving each iteration.
Minimising a non-linear least-squares loss with Levenberg–Marquardt.
Training every neural network; calibrating physics models; portfolio optimisation; hyperparameter search.
Optimization
| Piece | In plain words |
|---|---|
| θ | the knobs the model can change |
| f(θ) | a single score measuring how good those settings are |
| min / max | push that score the best direction |
🎮 Try it — interactive demo
The single score you’re trying to make as good as possible.
Maximise reward in RL; minimise prediction error in supervised learning.
Defines what ‘good’ means — choosing it wrong (clicks not satisfaction) causes misaligned systems.
| Piece | In plain words |
|---|---|
| y | the true answer |
| ŷ | the model's prediction |
| (y−ŷ)² or y log ŷ | penalty that grows as the prediction gets worse |
| average | mean penalty over all examples |
🎮 Try it — interactive demo
A number that’s big when the model is wrong and small when it’s right.
Cross-entropy for classification; MSE for regression; contrastive loss for embeddings.
Loss choice shapes behaviour: MSE punishes outliers hard; cross-entropy suits probabilities; focal loss for imbalance.
| Piece | In plain words |
|---|---|
| convex | the surface is a single smooth valley (f″ ≥ 0) |
| local min | wherever you stop rolling downhill |
| = global min | that stop is guaranteed to be THE best point |
🎮 Try it — interactive demo
A single smooth valley — wherever you roll downhill you reach the one true bottom.
Linear/logistic regression and SVMs are convex.
Convex problems solve reliably and provably — used in finance, control, and as the well-behaved core of larger systems.
| Piece | In plain words |
|---|---|
| local minima | many 'good enough' valleys |
| saddle points | flat spots that aren't minima |
| no guarantee | you might not find the single deepest valley |
🎮 Try it — interactive demo
A bumpy mountain range full of dips — you might settle in a ‘good enough’ valley.
Every deep neural network loss surface is non-convex.
Deep learning works anyway: in high dimensions most local minima are nearly as good, and SGD noise escapes bad spots.
| Piece | In plain words |
|---|---|
| ∇Loss(all data) | average advice from every single example |
| η | the learning rate (step size) |
| θ ← θ − … | step downhill once per full pass |
🎮 Try it — interactive demo
Look at every example, average the advice, then take one careful step. Accurate but slow.
Used when datasets fit in memory and a smooth path matters.
Rare for large data (one step = a full pass); used in classical ML and as a baseline.
| Piece | In plain words |
|---|---|
| one example | a single randomly-picked data point |
| ∇Loss | its noisy gradient |
| step | update after every single example |
🎮 Try it — interactive demo
Take a step after every example — noisy but fast, and able to escape traps.
Robbins–Monro stochastic approximation (1951).
The noise acts as a regulariser; foundation of modern training, usually replaced by mini-batches in practice.
| Part | In plain words |
|---|---|
| ▲ Top Σ ∇Loss over B examples |
add up the gradients of a small handful of examples |
| ▼ Bottom B |
divide by the batch size to average them |
🎮 Try it — interactive demo
Look at a small handful each step — stable enough, and perfect for GPU parallelism.
Batch size 256 is a common default.
The actual workhorse of deep learning; batch size trades gradient noise vs hardware throughput.
| Piece | In plain words |
|---|---|
| β v | keep 90% of last step's velocity |
| + ∇Loss | add the new gradient |
| θ ← θ − η v | move along the built-up velocity |
🎮 Try it — interactive demo
Build up speed downhill like a rolling ball — push through bumps and flat spots.
Polyak’s heavy-ball method.
Speeds convergence in ravines; SGD+momentum often beats fancier optimisers for vision.
| Piece | In plain words |
|---|---|
| θ − ηβv | peek where momentum is about to carry you |
| gradient there | measure the slope at that future spot |
| correct early | adjust before overshooting |
🎮 Try it — interactive demo
Peek where momentum is about to carry you and correct early — an anticipatory ball.
Nesterov accelerated gradient (1983).
Slightly faster and more stable than plain momentum; a flag in every framework.
| Part | In plain words |
|---|---|
| ▲ Top learning rate |
the base step size |
| ▼ Bottom √(Σ g²) |
grows with accumulated gradient size → shrinks the step per parameter |
🎮 Try it — interactive demo
Big steps for rare parameters, small steps for frequent ones.
Great for sparse features (text, recommendations).
Strong early, but the ever-growing denominator eventually kills the step — which motivated RMSProp/Adam.
| Part | In plain words |
|---|---|
| ▲ Top learning rate |
base step size |
| ▼ Bottom √E[g²] |
root of a DECAYING average of squared gradients — forgets old ones |
🎮 Try it — interactive demo
Like AdaGrad but forgets old gradients, so the step doesn’t die out.
Proposed by Hinton in a Coursera lecture.
Default for RNNs and reinforcement learning; handles non-stationary objectives well.
| Part | In plain words |
|---|---|
| ▲ Top m̂ (momentum) |
a decaying average of gradients — the direction with inertia |
| ▼ Bottom √v̂ + ε |
a decaying average of squared gradients — adapts the step size; ε avoids ÷0 |
🎮 Try it — interactive demo
Adaptive step size AND momentum together — robust defaults that ‘just work’.
Kingma & Ba 2014; defaults β₁=0.9, β₂=0.999, η=1e-3.
The most widely used optimiser in deep learning; trains transformers, GANs, and most published models.
| Piece | In plain words |
|---|---|
| Adam step | the usual adaptive momentum update |
| + λθ | pull every weight toward zero, applied directly (decoupled) |
| decoupled | the shrink is NOT folded into the gradient — so it works correctly |
🎮 Try it — interactive demo
Adam done right — keeps the weight-shrinking penalty separate so it works as intended.
Loshchilov & Hutter 2017.
The standard optimiser for training LLMs (BERT, GPT, LLaMA) — better generalisation than vanilla Adam.
| Piece | In plain words |
|---|---|
| η₀ | the starting learning rate |
| cos(π t/T) | smoothly falls from +1 to −1 over training |
| ηₜ | so the rate eases from η₀ down to 0 |
🎮 Try it — interactive demo
Start cautious, go fast, then slow down to settle.
Warmup + cosine decay is standard for transformers.
Critical for LLM training stability; warmup prevents early divergence; cosine decay squeezes out final accuracy.
| Part | In plain words |
|---|---|
| ▲ Top (λ/2) ‖θ‖² |
a penalty that grows with the size of the weights |
| ▼ Bottom added to Loss |
so training is pushed to keep weights small AND fit the data |
🎮 Try it — interactive demo
Gently pull every weight toward zero so the model stays simple.
λ = 0.01 is a common value.
Reduces overfitting in nearly every trained model; in AdamW it’s the main regulariser for LLMs.
| Piece | In plain words |
|---|---|
| validation loss | error on held-out data |
| patience p | how many stalled epochs to tolerate |
| stop | quit before the model starts memorising |
🎮 Try it — interactive demo
Quit while you’re ahead — stop before the model memorises the training set.
Patience of 5–10 epochs is typical.
Cheap, effective overfitting guard; saves compute; used everywhere from Kaggle to production.
| Piece | In plain words |
|---|---|
| hyperparameters | settings you choose, not learned by gradient |
| search | grid / random / model-based exploration |
| best result | the config with the best validation score |
🎮 Try it — interactive demo
Tune the ‘knobs’ you set by hand to get the best model.
Random search often beats grid search (Bergstra & Bengio 2012).
Optuna / Ray Tune in practice; can swing accuracy by points — the gap between mediocre and winning.
| Piece | In plain words |
|---|---|
| surrogate | a cheap model guessing how each config performs |
| acquisition | picks the most promising config to try next |
| update | retrain the guess after each real trial |
🎮 Try it — interactive demo
Learn a cheap guess of performance, then test the most promising setting next.
GPyOpt, Optuna’s TPE, Google Vizier.
Tunes expensive models (each trial = hours of GPU) in few trials; used for AutoML and even chemistry/hardware design.
Probability & Statistics
| Piece | In plain words |
|---|---|
| outcome | a result of a chance experiment |
| X | maps that outcome to a number |
| p(x) / f(x) | how likely each value is (discrete / continuous) |
🎮 Try it — interactive demo
A number whose value depends on chance — a die roll, tomorrow’s temperature.
X = sum of two dice, ranging 2–12.
Model inputs, labels, and predictions as random variables; uncertainty estimates; sampling in generative models.
| Piece | In plain words |
|---|---|
| p | chance of a '1' (success) |
| 1 − p | chance of a '0' (failure) |
| mean = p | the long-run fraction of 1s |
🎮 Try it — interactive demo
One coin flip with a possibly-biased coin.
p = 0.5 for a fair coin.
Binary classification output (spam/not); each pixel of a binarised image; click/no-click modelling.
| Piece | In plain words |
|---|---|
| C(n,k) | number of ways to pick which k trials succeed |
| pᵏ | probability those k succeed |
| (1−p)ⁿ⁻ᵏ | probability the other n−k fail |
| mean = np | expected number of successes |
🎮 Try it — interactive demo
How many heads in n coin flips.
10 flips, p=0.5: expect 5 heads.
Conversion counts in A/B tests; defect counts in QA; correct predictions out of n.
| Part | In plain words |
|---|---|
| ▲ Top λᵏ · e^(−λ) |
weight for seeing k events when the average rate is λ |
| ▼ Bottom k! |
divide by k-factorial (the count's arrangements) |
🎮 Try it — interactive demo
How many rare things happen in a window — calls per hour, typos per page.
λ = 3 emails/hour.
Arrival rates (web traffic, network packets); count data in NLP; queueing systems.
| Part | In plain words |
|---|---|
| ▲ Top 1 |
equal weight for every value |
| ▼ Bottom b − a |
spread that weight evenly across the whole range |
🎮 Try it — interactive demo
Total fairness — every value in the range is just as probable.
A random float in [0, 1).
Random weight init, dropout masks, random seeds, Monte-Carlo sampling.
| Part | In plain words |
|---|---|
| ▲ Top e^( −(x−μ)²/2σ² ) |
peaks at the mean μ and falls off as you move away (curve shape) |
| ▼ Bottom √(2πσ²) |
a constant that makes the whole area equal 1 |
🎮 Try it — interactive demo
The classic bell curve — most values near the average, few far out.
Heights, measurement noise; 68% fall within ±1σ.
Weight init, noise models, VAE priors, diffusion noise, Gaussian processes — the most important distribution in ML.
| Piece | In plain words |
|---|---|
| i.i.d. variables | many independent samples from the same source |
| their average | sum them and divide by n |
| → Gaussian | that average looks bell-shaped, whatever the source |
🎮 Try it — interactive demo
Average enough random things and the result looks like a bell curve.
The average of 30+ dice rolls is approximately normal.
Why Gaussian assumptions work so often; the basis of confidence intervals and many statistical tests.
| Piece | In plain words |
|---|---|
| x | each possible value |
| p(x) | its probability (the weight) |
| Σ x·p(x) | probability-weighted average → the long-run mean |
🎮 Try it — interactive demo
The average value you’d get if you repeated the experiment forever.
Fair die: E[X] = 3.5.
Expected loss is what training minimises; expected reward in RL; expected value drives every risk/decision calc.
| Part | In plain words |
|---|---|
| ▲ Top (X − μ)² |
squared distance of each value from the mean |
| ▼ Bottom averaged |
take the expected (mean) of those squared distances |
🎮 Try it — interactive demo
How spread out the values are around the average — small = consistent, large = erratic.
Fair die variance ≈ 2.92.
Bias–variance tradeoff governs over/underfitting; gradient variance affects stability; risk in finance.
| Piece | In plain words |
|---|---|
| (X−μₓ) | how far X is from its mean |
| (Y−μᵧ) | how far Y is from its mean |
| their product, averaged | positive if they move together, negative if oppositely |
🎮 Try it — interactive demo
Do two quantities tend to rise and fall together (positive) or oppositely (negative)?
Height and weight have positive covariance.
The covariance matrix drives PCA; portfolio risk; feature decorrelation and whitening.
| Part | In plain words |
|---|---|
| ▲ Top Cov(X,Y) |
how much X and Y move together |
| ▼ Bottom σₓ · σᵧ |
divide by their spreads → a clean score from −1 to +1 |
🎮 Try it — interactive demo
Covariance normalised to −1…+1; ±1 = perfect line, 0 = no linear link.
ρ = 0.9 is a strong positive relationship.
Feature selection, detecting redundancy/leakage, exploratory analysis — but correlation ≠ causation.
| Piece | In plain words |
|---|---|
| distribution | the probability pattern to draw from |
| x ~ | generate values that follow that pattern |
| methods | inverse-CDF, rejection, Markov-chain Monte Carlo |
🎮 Try it — interactive demo
Generating example values that follow a chosen probability pattern.
Sampling from N(0,1) with np.random.randn.
Bootstrapping for uncertainty; mini-batch selection; generating text/images from a model’s distribution.
| Part | In plain words |
|---|---|
| ▲ Top σ |
the spread of the data |
| ▼ Bottom √n |
shrinks with more samples → a tighter interval |
🎮 Try it — interactive demo
A plausible range for the true value, with a stated level of trust.
A 95% CI uses z ≈ 1.96.
Reporting metric uncertainty (accuracy ± CI); A/B-test result ranges; scientific reproducibility.
| Piece | In plain words |
|---|---|
| H₀ | the boring default: nothing is happening |
| H₁ | the claim: there's a real effect |
| test statistic | reject H₀ if the evidence is strong enough |
🎮 Try it — interactive demo
A formal way to decide whether an observed effect is real or just luck.
t-test, chi-square test, z-test.
Deciding if model B truly beats model A; if a feature matters; if an A/B variant won.
| Piece | In plain words |
|---|---|
| H₀ true | assume nothing is really going on |
| data this extreme | chance of seeing a result as surprising as yours |
| small p | unlikely under H₀ → probably a real effect |
🎮 Try it — interactive demo
The chance of seeing your result if nothing were going on — small means a likely real effect.
p = 0.03 < 0.05 → statistically significant.
Gatekeeper for A/B-test decisions; widely misused (p-hacking) — significance ≠ practical importance.
| Piece | In plain words |
|---|---|
| random split | half see A, half see B — fair comparison |
| metric | the number you care about (conversions, revenue) |
| compare | test if B's lift is significant, not luck |
🎮 Try it — interactive demo
Show version A to half your users and B to the other half, then measure which wins.
Testing a new recommendation model against the current one.
How tech companies ship product and ML changes; needs proper sample size, randomisation, and no peeking.
| Piece | In plain words |
|---|---|
| p(xᵢ|θ) | how probable each data point is, given settings θ |
| Π (product) | multiply across all data points |
| argmax | pick the θ that makes the observed data most probable |
🎮 Try it — interactive demo
Pick the parameters that make the data you actually observed most probable.
MLE of a Gaussian’s mean is the sample average.
Training many models IS maximum likelihood — minimising cross-entropy = maximising label likelihood.
| Piece | In plain words |
|---|---|
| p(x|θ) | likelihood: how well θ explains the data (as in MLE) |
| p(θ) | prior: your belief about reasonable θ before seeing data |
| argmax of product | balance the two → like MLE plus regularisation |
🎮 Try it — interactive demo
Like MLE, but you also bring in prior beliefs about reasonable parameter values.
A Gaussian prior on weights ⟺ L2 regularisation.
Explains why weight decay works; injects domain knowledge; stabilises estimates with little data.
| Part | In plain words |
|---|---|
| ▲ Top p(x | θ) · p(θ) |
likelihood × prior — evidence combined with belief |
| ▼ Bottom p(x) |
a normaliser so the posterior sums/integrates to 1 |
🎮 Try it — interactive demo
Start with a belief, see evidence, revise it — and keep a full distribution, not just one number.
Bayesian A/B testing; Bayesian neural networks.
Uncertainty-aware predictions (medicine, autonomy); Bayesian optimisation for tuning; naive-Bayes spam filters.