ML Simplified

The math behind machine learning, explained four ways — every formula read as a plain-language table, each with an interactive demo. A static OneNote-style reference.

📓 How to read this notebook

Pick a subject tab, then a topic tab. Each topic has four panes — and the formula is shown as a plain-language table (fractions become top-over-bottom). Every Mathematical pane also has a 🎮 live interactive demo you can drag and play with.

Linear Algebra

🧮 The Formula — read as a table

a + b = [ a₁+b₁ , a₂+b₂ , … , aₙ+bₙ ]

Piece	In plain words
a , b	the two arrows (lists of numbers) you're combining
aᵢ + bᵢ	add the matching slots, one position at a time
result	a new arrow — the tip-to-tail diagonal of the two

📖 Reads as: Add two arrows by adding their matching numbers; the answer is where you land if you walk the first arrow then the second.

🎮 Try it — interactive demo

💡 Simplified

Stack two arrows tip-to-tail; where you end up is the sum. Order doesn’t matter.

📘 Examples & References

[1,2] + [3,1] = [4,3]. Reference: 3Blue1Brown, Essence of Linear Algebra Ch.1.

⚙️ Real-World Values & Applications

Summing gradient contributions across a batch; moving an embedding by a ‘direction’ (king − man + woman ≈ queen).

🧮 The Formula — read as a table

a − b = [ a₁−b₁ , a₂−b₂ , … , aₙ−bₙ ]

Piece	In plain words
a − b	subtract matching slots
result	the arrow pointing FROM b TO a

📖 Reads as: Subtract matching numbers; the answer is the arrow that takes you from b to a — 'where you are minus where you were'.

🎮 Try it — interactive demo

💡 Simplified

The arrow that gets you from one point to another.

📘 Examples & References

[4,3] − [3,1] = [1,2].

⚙️ Real-World Values & Applications

Error vectors (prediction − target); displacement between two GPS points; embedding differences that capture relationships.

🧮 The Formula — read as a table

a · b = a₁b₁ + a₂b₂ + … + aₙbₙ = ‖a‖ ‖b‖ cos θ

Piece	In plain words
a₁b₁ + a₂b₂ + …	multiply matching slots, then add them all up
‖a‖ ‖b‖	the two arrows' lengths multiplied
cos θ	how aligned they are: 1 = same way, 0 = perpendicular, −1 = opposite
result	one number: how much the two arrows point the same way

📖 Reads as: Multiply-and-sum the matching numbers; the bigger the result, the more the two arrows agree in direction.

🎮 Try it — interactive demo

💡 Simplified

A single number measuring how much two vectors point the same way. Big = aligned, 0 = unrelated, negative = opposite.

📘 Examples & References

[1,2]·[3,1] = 1·3 + 2·1 = 5. Cosine similarity is the normalised dot product.

⚙️ Real-World Values & Applications

Every neuron computes a dot product of weights and inputs; attention scores are query·key dot products; recommender similarity.

🧮 The Formula — read as a table

a × b ⟂ both, ‖a × b‖ = ‖a‖ ‖b‖ sin θ

Piece	In plain words
a × b	a NEW arrow at right angles to both inputs (3D only)
‖a‖ ‖b‖ sin θ	its length = area of the parallelogram a and b form
direction	perpendicular to the plane of a and b (right-hand rule)

📖 Reads as: Combine two 3D arrows to get a third arrow that sticks straight out of the surface they span.

🎮 Try it — interactive demo

💡 Simplified

Multiply two 3D arrows to get a third arrow at right angles to both — the ‘normal’ to their plane.

📘 Examples & References

[1,0,0] × [0,1,0] = [0,0,1].

⚙️ Real-World Values & Applications

Surface normals in 3D graphics and robotics; torque and angular momentum in physics engines; camera orientation.

🧮 The Formula — read as a table

‖x‖₁ = |x₁| + |x₂| + … + |xₙ|

Piece	In plain words
\|xᵢ\|	drop the minus signs — take each number's size
sum	add up all those sizes
result	total 'city-block' distance

📖 Reads as: Add up the absolute sizes of every component — the distance if you can only travel along a grid.

🎮 Try it — interactive demo

💡 Simplified

Total distance if you can only move along grid streets, never diagonally.

📘 Examples & References

‖[3,−4]‖₁ = 3 + 4 = 7.

⚙️ Real-World Values & Applications

L1 (Lasso) regularisation pushes weights to exactly zero → automatic feature selection; robust error metric (MAE).

🧮 The Formula — read as a table

‖x‖₂ = √( x₁² + x₂² + … + xₙ² )

√( x₁² + x₂² + … )

(under the root)

Part	In plain words
▲ Top √( x₁² + x₂² + … )	square each number, add them, then take the square root
▼ Bottom (under the root)	this is just the Pythagorean theorem in n dimensions

📖 Reads as: Square every component, add them, square-root the total — the straight-line length of the arrow.

🎮 Try it — interactive demo

💡 Simplified

The ‘as-the-crow-flies’ length of the arrow.

📘 Examples & References

‖[3,−4]‖₂ = √(9+16) = √25 = 5.

⚙️ Real-World Values & Applications

L2 / weight decay keeps weights small and smooth; normalising embeddings to unit length; gradient clipping by norm.

🧮 The Formula — read as a table

projꞵ a = ( a · b / ‖b‖² ) · b

a · b

‖b‖²

Part	In plain words
▲ Top a · b	how much a points along b (their dot product)
▼ Bottom ‖b‖²	b's length squared — divides out b's size so it's a fair ratio

📖 Reads as: Divide 'how much a leans toward b' by 'b's size', then scale b by that — the shadow a casts on b.

🎮 Try it — interactive demo

💡 Simplified

The shadow one arrow casts on another when light shines straight down on it.

📘 Examples & References

Projecting [2,2] onto the x-axis [1,0] gives [2,0].

⚙️ Real-World Values & Applications

PCA projects data onto principal directions; least-squares regression projects targets onto the feature space.

🧮 The Formula — read as a table

a ⟂ b ⟺ a · b = 0

Piece	In plain words
a · b = 0	their dot product is exactly zero
⟂	the arrows meet at 90° — perpendicular
meaning	they share no common direction at all

📖 Reads as: If the dot product is zero, the two arrows are at right angles and completely unrelated in direction.

🎮 Try it — interactive demo

💡 Simplified

Two directions that share no overlap — knowing one tells you nothing about the other.

📘 Examples & References

[1,0]·[0,1] = 0 → the axes are orthogonal.

⚙️ Real-World Values & Applications

Orthogonal weight init prevents signal blow-up; orthogonal bases (Fourier, wavelets) for compression; decorrelated features.

🧮 The Formula — read as a table

(AB)ᵢⱼ = Aᵢ₁B₁ⱼ + Aᵢ₂B₂ⱼ + … + AᵢₙBₙⱼ

Piece	In plain words
row i of A	one horizontal strip from the first matrix
column j of B	one vertical strip from the second matrix
dot product	multiply them slot-by-slot and sum → cell (i,j) of the answer
rule	inner sizes must match: (m×n)(n×p) = (m×p)

📖 Reads as: Each output cell is the dot product of a row from A and a column from B; it chains two transformations into one.

🎮 Try it — interactive demo

💡 Simplified

Every output cell is a row·column dot product. It glues two linear transformations together.

📘 Examples & References

[[1,2],[3,4]] · [1,1]ᵀ = [3,7]ᵀ.

⚙️ Real-World Values & Applications

The heart of deep learning — every dense layer is Wx+b; ~99% of training compute is matrix multiplication (why GPUs exist).

🧮 The Formula — read as a table

(Aᵀ)ᵢⱼ = Aⱼᵢ (AB)ᵀ = Bᵀ Aᵀ

Piece	In plain words
Aⱼᵢ	swap the row and column index of every entry
effect	rows become columns, columns become rows (flip across the diagonal)

📖 Reads as: Flip the matrix across its diagonal so rows turn into columns.

🎮 Try it — interactive demo

💡 Simplified

Turn rows into columns and columns into rows.

📘 Examples & References

[[1,2],[3,4]]ᵀ = [[1,3],[2,4]].

⚙️ Real-World Values & Applications

Backprop multiplies by Wᵀ to push gradients backward; Gram matrices XᵀX for covariance and style transfer.

🧮 The Formula — read as a table

A⁻¹ A = I (exists if det A ≠ 0)

Piece	In plain words
A⁻¹	the 'undo' matrix
A⁻¹A = I	applying it after A returns you to where you started (identity)
det A ≠ 0	only invertible if the matrix doesn't squash space flat

📖 Reads as: The inverse is the matrix that undoes A; it only exists when A doesn't collapse any dimension.

🎮 Try it — interactive demo

💡 Simplified

The ‘undo’ matrix — apply it to reverse what A did.

📘 Examples & References

[[2,0],[0,2]]⁻¹ = [[0.5,0],[0,0.5]].

⚙️ Real-World Values & Applications

Closed-form regression β̂ = (XᵀX)⁻¹Xᵀy; Kalman filters; in practice we solve systems rather than invert (faster, stabler).

🧮 The Formula — read as a table

det [[a,b],[c,d]] = a·d − b·c

Piece	In plain words
a·d	product of the main diagonal
b·c	product of the off-diagonal
difference	the signed area/volume scaling factor; 0 = space squashed flat

📖 Reads as: Multiply the diagonals and subtract; the result tells you how much the matrix stretches area (zero = it flattens space).

🎮 Try it — interactive demo

💡 Simplified

How much the transformation stretches or shrinks space. Zero means it squashes everything flat.

📘 Examples & References

det [[1,2],[3,4]] = 1·4 − 2·3 = −2.

⚙️ Real-World Values & Applications

Normalising flows use log|det J|; checking invertibility; computing Gaussian densities.

🧮 The Formula — read as a table

rank(A) = number of independent rows (or columns)

Piece	In plain words
independent rows	rows that can't be built from the others
count	how many genuinely different directions A can reach
low rank	lots of redundancy — the matrix is 'simpler' than its size

📖 Reads as: Count how many rows are truly different; that's how many independent directions the matrix spans.

🎮 Try it — interactive demo

💡 Simplified

How many genuinely different directions the matrix can reach. Low rank = redundancy.

📘 Examples & References

[[1,2],[2,4]] has rank 1 (row 2 = 2 × row 1).

⚙️ Real-World Values & Applications

Low-rank factorisation (LoRA!) fine-tunes LLMs with tiny matrices; recommenders factor the ratings matrix; PCA.

🧮 The Formula — read as a table

tr(A) = A₁₁ + A₂₂ + … + Aₙₙ = Σ eigenvalues

Piece	In plain words
Aᵢᵢ	the diagonal entries
sum	just add them up
= Σ eigenvalues	it also equals the sum of all eigenvalues

📖 Reads as: Add up the diagonal entries — which happens to equal the sum of the eigenvalues.

🎮 Try it — interactive demo

💡 Simplified

Just add up the diagonal entries.

📘 Examples & References

tr [[1,2],[3,4]] = 1 + 4 = 5.

⚙️ Real-World Values & Applications

Simplifies gradient derivations (cyclic identities); trace of covariance = total variance; information measures.

🧮 The Formula — read as a table

c₁v₁ + c₂v₂ + … + cₙvₙ = 0 only when all cᵢ = 0

Piece	In plain words
Σ cᵢvᵢ = 0	a weighted combination of the vectors equals the zero vector
all cᵢ = 0	the ONLY way to get zero is to use no vectors at all
meaning	none of the vectors is redundant

📖 Reads as: If the only way to combine the vectors into zero is to multiply them all by zero, none is redundant — they're independent.

🎮 Try it — interactive demo

💡 Simplified

No vector in the set can be built from the others — none is wasted.

📘 Examples & References

[1,0] and [0,1] are independent; [1,0] and [2,0] are not.

⚙️ Real-World Values & Applications

Independent features avoid multicollinearity (unstable coefficients); guarantee a unique least-squares solution.

🧮 The Formula — read as a table

any v = c₁e₁ + c₂e₂ + … + cₙeₙ (unique cᵢ)

Piece	In plain words
e₁ … eₙ	the building-block directions (independent + spanning)
cᵢ	the unique coordinates of v in that basis
any v	every vector in the space can be built this way

📖 Reads as: A basis is the smallest set of directions from which every vector can be uniquely rebuilt.

🎮 Try it — interactive demo

💡 Simplified

The smallest set of building-block directions you can construct everything else from.

📘 Examples & References

Standard basis of 3D space: [1,0,0], [0,1,0], [0,0,1].

⚙️ Real-World Values & Applications

Embedding dimensions are a learned basis for meaning; Fourier/wavelet bases for compression; good basis = good features.

🧮 The Formula — read as a table

closed: u,v ∈ S ⟹ u+v ∈ S and c·u ∈ S

Piece	In plain words
u+v ∈ S	add two members → still inside
c·u ∈ S	scale a member → still inside
always contains 0	every subspace passes through the origin

📖 Reads as: A subspace is a flat slice (line or plane through the origin) that you can't escape by adding or scaling its members.

🎮 Try it — interactive demo

💡 Simplified

A flat ‘slice’ of space — a line or plane through the origin — that stays inside itself.

📘 Examples & References

The x–y plane is a 2D subspace of 3D space.

⚙️ Real-World Values & Applications

Data often lives on a low-dimensional subspace (manifold) inside high-dim space — the basis of dimensionality reduction.

🧮 The Formula — read as a table

A v = λ v found via det(A − λI) = 0

Piece	In plain words
A v	apply the matrix to a special vector v
λ v	the result is just v scaled by a number λ (no rotation)
λ	the eigenvalue — the stretch factor along that direction

📖 Reads as: An eigenvalue λ is the amount a matrix stretches one of its special, un-rotated directions.

🎮 Try it — interactive demo

💡 Simplified

The scaling factors along the special directions a matrix doesn’t rotate.

📘 Examples & References

[[2,0],[0,3]] has eigenvalues 2 and 3.

⚙️ Real-World Values & Applications

PCA’s eigenvalues = variance per component; PageRank is the top eigenvector; spectral norm bounds training stability.

🧮 The Formula — read as a table

A v = λ v , v ≠ 0

Piece	In plain words
v	a direction the matrix only stretches, never turns
λ	how much it stretches that direction
A v = λ v	applying A keeps v on the same line

📖 Reads as: An eigenvector is a direction that a transformation merely stretches without turning.

🎮 Try it — interactive demo

💡 Simplified

The directions a transformation only stretches without rotating.

📘 Examples & References

For [[2,0],[0,3]], the eigenvectors are [1,0] and [0,1].

⚙️ Real-World Values & Applications

Principal components are eigenvectors of the covariance matrix; eigenfaces for face recognition; vibration modes.

🧮 The Formula — read as a table

A = Q Λ Qᵀ (A symmetric)

Piece	In plain words
Q	rotation: the orthonormal eigenvectors
Λ	scaling: a diagonal of eigenvalues
Qᵀ	rotate back

📖 Reads as: A symmetric matrix can be rewritten as: rotate → stretch each axis → rotate back.

🎮 Try it — interactive demo

💡 Simplified

Rewrite a symmetric matrix as rotate, scale each axis, then rotate back.

📘 Examples & References

Covariance matrices are symmetric, so they always decompose this way.

⚙️ Real-World Values & Applications

Whitening / decorrelating data; matrix square-roots and powers; the math behind PCA and Gaussian processes.

🧮 The Formula — read as a table

A = U Σ Vᵀ (works for ANY matrix)

Piece	In plain words
V ᵀ	first rotation (in the input space)
Σ	stretch along axes by the singular values
U	second rotation (in the output space)

📖 Reads as: Any matrix at all is: rotate → stretch → rotate. The stretch amounts are the singular values.

🎮 Try it — interactive demo

💡 Simplified

The universal decomposition: rotate, stretch along axes, rotate again — for ANY matrix.

📘 Examples & References

Keeping the top-k singular values gives the best rank-k approximation (Eckart–Young).

⚙️ Real-World Values & Applications

Latent semantic analysis, image compression, Netflix-prize recommenders, and the theory behind low-rank LLM adapters.

🧮 The Formula — read as a table

A = Q R

Piece	In plain words
Q	orthonormal columns — a clean rotation part
R	upper-triangular — the 'bookkeeping' part
use	solve A x = b stably without forming A⁻¹

📖 Reads as: Split a matrix into a clean rotation (Q) and a triangular remainder (R).

🎮 Try it — interactive demo

💡 Simplified

Split a matrix into a clean rotation part and a triangular bookkeeping part.

📘 Examples & References

Used to solve A x = b stably; the QR algorithm finds eigenvalues.

⚙️ Real-World Values & Applications

Numerically stable least-squares; eigenvalue computation; orthogonalising layers in deep nets.

🧮 The Formula — read as a table

xᵀ A x > 0 for all x ≠ 0 ⟺ all eigenvalues > 0

Piece	In plain words
xᵀ A x	a quadratic 'energy' score for any input x
> 0	always positive — the surface curves upward everywhere
eigenvalues > 0	equivalent condition: every stretch factor is positive

📖 Reads as: If the matrix gives a positive 'energy' for every input, it's a perfect bowl with one lowest point.

🎮 Try it — interactive demo

💡 Simplified

A matrix that always curves upward — like a perfect bowl with a single lowest point.

📘 Examples & References

The identity matrix is positive definite; any covariance matrix is at least semi-definite.

⚙️ Real-World Values & Applications

Guarantees a unique optimum in convex problems; enables Cholesky factorisation; kernel matrices in SVMs and GPs.

🧮 The Formula — read as a table

shape (batch, height, width, channels) — n axes

Piece	In plain words
scalar	0 axes — a single number
vector / matrix	1 / 2 axes
tensor	3+ axes — e.g. a batch of colour images
contraction	the einsum generalisation of matrix multiply

📖 Reads as: Tensors are arrays with any number of axes; deep-learning data and weights are tensors.

🎮 Try it — interactive demo

💡 Simplified

Arrays with more than two axes — e.g. a batch of colour images is a 4D tensor.

📘 Examples & References

A 224×224 RGB batch of 32 images has shape (32, 224, 224, 3).

⚙️ Real-World Values & Applications

Every DL framework (PyTorch, TensorFlow) is a tensor engine; attention runs on 4D tensors; einsum expresses contractions.

Calculus

Foundations

🧮 The Formula — read as a table

lim (x→a) f(x) = L

Piece	In plain words
x → a	let the input creep toward a
f(x) → L	the output settles toward L
meaning	the value the function heads to, even if it never arrives

📖 Reads as: As the input approaches a, the output approaches L — the value the function is heading toward.

🎮 Try it — interactive demo

💡 Simplified

What value a function is heading toward as the input approaches a point.

📘 Examples & References

lim (x→0) sin(x)/x = 1.

⚙️ Real-World Values & Applications

Underpins why gradient-descent steps work; learning-rate decay ‘in the limit’; defining derivatives that power backprop.

🧮 The Formula — read as a table

f′(x) = lim (h→0) [ f(x+h) − f(x) ] / h

f(x+h) − f(x)

Part	In plain words
▲ Top f(x+h) − f(x)	how much the output changes over a tiny step h
▼ Bottom h	the size of that tiny step (shrunk toward zero)

📖 Reads as: Change in output divided by a vanishingly small change in input — the slope right at this point.

🎮 Try it — interactive demo

💡 Simplified

How fast the output changes when you nudge the input — the steepness right here.

📘 Examples & References

d/dx (x²) = 2x; at x=3 the slope is 6.

⚙️ Real-World Values & Applications

The gradient (vector of derivatives) tells each weight which way and how much to move to cut the loss.

🧮 The Formula — read as a table

∂f/∂xᵢ = slope in the xᵢ direction (others fixed)

∂f

∂xᵢ

Part	In plain words
▲ Top ∂f	tiny change in the output
▼ Bottom ∂xᵢ	tiny change in ONE input, holding all the rest frozen

📖 Reads as: The output's slope when you wiggle just one input and freeze every other.

🎮 Try it — interactive demo

💡 Simplified

Change in the output when you wiggle one input and freeze the rest.

📘 Examples & References

f = x²y → ∂f/∂x = 2xy, ∂f/∂y = x².

⚙️ Real-World Values & Applications

Nets have millions of parameters; each weight update uses the partial derivative of the loss w.r.t. that weight.

🧮 The Formula — read as a table

d/dx f(g(x)) = f′(g(x)) · g′(x)

Piece	In plain words
g′(x)	slope of the inner function
f′(g(x))	slope of the outer function, measured at g(x)
multiply	chain the slopes together, outside-in

📖 Reads as: To differentiate nested functions, multiply the outer slope by the inner slope.

🎮 Try it — interactive demo

💡 Simplified

For nested functions, multiply the slopes of each layer together, outside-in.

📘 Examples & References

d/dx sin(x²) = cos(x²) · 2x.

⚙️ Real-World Values & Applications

Backpropagation IS the chain rule applied layer by layer — the single most important formula in deep learning.

🧮 The Formula — read as a table

∇f = [ ∂f/∂x₁ , ∂f/∂x₂ , … , ∂f/∂xₙ ]

Piece	In plain words
each ∂f/∂xᵢ	the slope in one input direction
vector	bundle all the slopes into one arrow
∇f	points toward the steepest increase

📖 Reads as: Bundle every partial slope into one arrow; it points uphill the fastest — so step the other way to go down.

🎮 Try it — interactive demo

💡 Simplified

An arrow pointing toward the fastest increase; step opposite to go downhill.

📘 Examples & References

f = x² + y² → ∇f = [2x, 2y].

⚙️ Real-World Values & Applications

Gradient descent moves parameters along −∇Loss; the whole training loop is ‘compute gradient, take a step’.

🧮 The Formula — read as a table

Jᵢⱼ = ∂fᵢ/∂xⱼ (matrix of all partials)

∂fᵢ

∂xⱼ

Part	In plain words
▲ Top ∂fᵢ	change in output number i
▼ Bottom ∂xⱼ	change in input number j — filled in for every (i,j) pair

📖 Reads as: A full table of how every output reacts to every input — the gradient generalised to many outputs.

🎮 Try it — interactive demo

💡 Simplified

A table of how every output reacts to every input.

📘 Examples & References

f(x,y) = [x², xy] → J = [[2x, 0],[y, x]].

⚙️ Real-World Values & Applications

Normalising flows use log|det J|; sensitivity analysis; vector-Jacobian products power autodiff backprop.

🧮 The Formula — read as a table

Hᵢⱼ = ∂²f / ∂xᵢ ∂xⱼ (curvature)

∂²f

∂xᵢ ∂xⱼ

Part	In plain words
▲ Top ∂²f	the SECOND change in the output
▼ Bottom ∂xᵢ ∂xⱼ	with respect to a pair of inputs — how the slope itself bends

📖 Reads as: The second-derivative table: tells you whether you're in a bowl, a dome, or a saddle.

🎮 Try it — interactive demo

💡 Simplified

Tells you how the slope itself is changing — bowl, dome, or saddle.

📘 Examples & References

f = x² + y² → H = [[2,0],[0,2]] (positive definite → a bowl).

⚙️ Real-World Values & Applications

Second-order optimisers (Newton, L-BFGS) use it; its eigenvalues diagnose saddle points that slow training.

🧮 The Formula — read as a table

f(x) ≈ f(a) + f′(a)(x−a) + ½ f″(a)(x−a)² + …

Piece	In plain words
f(a)	start from the value at a
f′(a)(x−a)	add the slope term (linear correction)
½ f″(a)(x−a)²	add the curvature term (quadratic correction)

📖 Reads as: Approximate any smooth curve near a point using its value, slope, curvature, and so on.

🎮 Try it — interactive demo

💡 Simplified

Approximate any smooth curve near a point using a polynomial built from its derivatives.

📘 Examples & References

eˣ ≈ 1 + x + x²/2 near 0.

⚙️ Real-World Values & Applications

Justifies gradient descent (first-order) and Newton’s method (second-order); solver approximations.

🧮 The Formula — read as a table

∫ₐᵇ f(x) dx = signed area under the curve

Piece	In plain words
f(x) dx	a thin slice: height × tiny width
∫ₐᵇ	add up infinitely many slices from a to b
result	total accumulated area / quantity / probability

📖 Reads as: Add up infinitely many thin slices under the curve to get a total.

🎮 Try it — interactive demo

💡 Simplified

Add up infinitely many thin slices to get a total — area, accumulated quantity, or probability.

📘 Examples & References

∫₀¹ x dx = ½.

⚙️ Real-World Values & Applications

Expected values and probabilities are integrals; the ELBO in VAEs; diffusion models integrate an SDE to generate images.

🧮 The Formula — read as a table

f′(x) ≈ [ f(x+h) − f(x−h) ] / (2h)

f(x+h) − f(x−h)

Part	In plain words
▲ Top f(x+h) − f(x−h)	measure the function a little ahead and a little behind
▼ Bottom 2h	divide by the total gap between those two points

📖 Reads as: Estimate a slope by sampling the function just ahead and just behind, then dividing by the gap.

🎮 Try it — interactive demo

💡 Simplified

Estimate a slope by measuring the function at nearby points — no formula needed.

📘 Examples & References

With h=0.01, central difference of x² at 3 ≈ 6.00.

⚙️ Real-World Values & Applications

Gradient-checking to verify hand-written backprop; finite-difference sensitivities when analytic gradients are missing.

🧮 The Formula — read as a table

θ ← θ − (step) · (search direction), repeat

Piece	In plain words
start θ	an initial guess
search direction	which way improves the objective (e.g. −gradient)
repeat	keep stepping until it stops improving

📖 Reads as: When you can't solve for the best answer directly, search for it step by step, improving each round.

🎮 Try it — interactive demo

💡 Simplified

Search for the best answer step by step, improving each iteration.

📘 Examples & References

Minimising a non-linear least-squares loss with Levenberg–Marquardt.

⚙️ Real-World Values & Applications

Training every neural network; calibrating physics models; portfolio optimisation; hyperparameter search.

Optimization

🧮 The Formula — read as a table

minimise (or maximise) f(θ) over parameters θ

Piece	In plain words
θ	the knobs the model can change
f(θ)	a single score measuring how good those settings are
min / max	push that score the best direction

📖 Reads as: A single 'goodness' score the whole search tries to make as good as possible.

🎮 Try it — interactive demo

💡 Simplified

The single score you’re trying to make as good as possible.

📘 Examples & References

Maximise reward in RL; minimise prediction error in supervised learning.

⚙️ Real-World Values & Applications

Defines what ‘good’ means — choosing it wrong (clicks not satisfaction) causes misaligned systems.

🧮 The Formula — read as a table

MSE = (1/n) Σ (y − ŷ)² CE = − Σ y log ŷ

Piece	In plain words
y	the true answer
ŷ	the model's prediction
(y−ŷ)² or y log ŷ	penalty that grows as the prediction gets worse
average	mean penalty over all examples

📖 Reads as: A number that's big when the model is wrong and small when it's right; training drives it down.

🎮 Try it — interactive demo

💡 Simplified

A number that’s big when the model is wrong and small when it’s right.

📘 Examples & References

Cross-entropy for classification; MSE for regression; contrastive loss for embeddings.

⚙️ Real-World Values & Applications

Loss choice shapes behaviour: MSE punishes outliers hard; cross-entropy suits probabilities; focal loss for imbalance.

🧮 The Formula — read as a table

any local minimum = the global minimum

Piece	In plain words
convex	the surface is a single smooth valley (f″ ≥ 0)
local min	wherever you stop rolling downhill
= global min	that stop is guaranteed to be THE best point

📖 Reads as: On a convex (bowl-shaped) surface, any bottom you reach is the one true bottom.

🎮 Try it — interactive demo

💡 Simplified

A single smooth valley — wherever you roll downhill you reach the one true bottom.

📘 Examples & References

Linear/logistic regression and SVMs are convex.

⚙️ Real-World Values & Applications

Convex problems solve reliably and provably — used in finance, control, and as the well-behaved core of larger systems.

🧮 The Formula — read as a table

many local minima, saddles, plateaus — no global guarantee

Piece	In plain words
local minima	many 'good enough' valleys
saddle points	flat spots that aren't minima
no guarantee	you might not find the single deepest valley

📖 Reads as: A bumpy landscape of many dips; you may settle in a good valley, not necessarily the deepest.

🎮 Try it — interactive demo

💡 Simplified

A bumpy mountain range full of dips — you might settle in a ‘good enough’ valley.

📘 Examples & References

Every deep neural network loss surface is non-convex.

⚙️ Real-World Values & Applications

Deep learning works anyway: in high dimensions most local minima are nearly as good, and SGD noise escapes bad spots.

🧮 The Formula — read as a table

θ ← θ − η · ∇Loss(whole dataset)

Piece	In plain words
∇Loss(all data)	average advice from every single example
η	the learning rate (step size)
θ ← θ − …	step downhill once per full pass

📖 Reads as: Look at every example, average their advice, then take one careful step downhill.

🎮 Try it — interactive demo

💡 Simplified

Look at every example, average the advice, then take one careful step. Accurate but slow.

📘 Examples & References

Used when datasets fit in memory and a smooth path matters.

⚙️ Real-World Values & Applications

Rare for large data (one step = a full pass); used in classical ML and as a baseline.

🧮 The Formula — read as a table

θ ← θ − η · ∇Loss(one random example)

Piece	In plain words
one example	a single randomly-picked data point
∇Loss	its noisy gradient
step	update after every single example

📖 Reads as: Step after every single example — noisy and jittery, but fast and able to escape traps.

🎮 Try it — interactive demo

💡 Simplified

Take a step after every example — noisy but fast, and able to escape traps.

📘 Examples & References

Robbins–Monro stochastic approximation (1951).

⚙️ Real-World Values & Applications

The noise acts as a regulariser; foundation of modern training, usually replaced by mini-batches in practice.

🧮 The Formula — read as a table

θ ← θ − η · (1/B) Σ ∇Loss(batch of B)

Σ ∇Loss over B examples

Part	In plain words
▲ Top Σ ∇Loss over B examples	add up the gradients of a small handful of examples
▼ Bottom B	divide by the batch size to average them

📖 Reads as: Average the gradient over a small batch (e.g. 256), then step — stable enough and GPU-friendly.

🎮 Try it — interactive demo

💡 Simplified

Look at a small handful each step — stable enough, and perfect for GPU parallelism.

📘 Examples & References

Batch size 256 is a common default.

⚙️ Real-World Values & Applications

The actual workhorse of deep learning; batch size trades gradient noise vs hardware throughput.

🧮 The Formula — read as a table

v ← β v + ∇Loss ; θ ← θ − η v (β ≈ 0.9)

Piece	In plain words
β v	keep 90% of last step's velocity
+ ∇Loss	add the new gradient
θ ← θ − η v	move along the built-up velocity

📖 Reads as: Build up speed downhill like a rolling ball, pushing through small bumps and flat spots.

🎮 Try it — interactive demo

💡 Simplified

Build up speed downhill like a rolling ball — push through bumps and flat spots.

📘 Examples & References

Polyak’s heavy-ball method.

⚙️ Real-World Values & Applications

Speeds convergence in ravines; SGD+momentum often beats fancier optimisers for vision.

🧮 The Formula — read as a table

gradient measured at θ − η β v (look ahead)

Piece	In plain words
θ − ηβv	peek where momentum is about to carry you
gradient there	measure the slope at that future spot
correct early	adjust before overshooting

📖 Reads as: Peek where momentum is about to take you, measure the slope there, and correct early.

🎮 Try it — interactive demo

💡 Simplified

Peek where momentum is about to carry you and correct early — an anticipatory ball.

📘 Examples & References

Nesterov accelerated gradient (1983).

⚙️ Real-World Values & Applications

Slightly faster and more stable than plain momentum; a flag in every framework.

🧮 The Formula — read as a table

step ∝ 1 / √( Σ past gradients² )

learning rate

√(Σ g²)

Part	In plain words
▲ Top learning rate	the base step size
▼ Bottom √(Σ g²)	grows with accumulated gradient size → shrinks the step per parameter

📖 Reads as: Give rarely-updated parameters big steps and frequently-updated ones small steps.

🎮 Try it — interactive demo

💡 Simplified

Big steps for rare parameters, small steps for frequent ones.

📘 Examples & References

Great for sparse features (text, recommendations).

⚙️ Real-World Values & Applications

Strong early, but the ever-growing denominator eventually kills the step — which motivated RMSProp/Adam.

🧮 The Formula — read as a table

E[g²] ← γ E[g²] + (1−γ) g² ; step ∝ 1/√E[g²]

learning rate

√E[g²]

Part	In plain words
▲ Top learning rate	base step size
▼ Bottom √E[g²]	root of a DECAYING average of squared gradients — forgets old ones

📖 Reads as: Like AdaGrad but forgets old gradients, so the step size never dies out.

🎮 Try it — interactive demo

💡 Simplified

Like AdaGrad but forgets old gradients, so the step doesn’t die out.

📘 Examples & References

Proposed by Hinton in a Coursera lecture.

⚙️ Real-World Values & Applications

Default for RNNs and reinforcement learning; handles non-stationary objectives well.

🧮 The Formula — read as a table

θ ← θ − η · m̂ / ( √v̂ + ε )

m̂ (momentum)

√v̂ + ε

Part	In plain words
▲ Top m̂ (momentum)	a decaying average of gradients — the direction with inertia
▼ Bottom √v̂ + ε	a decaying average of squared gradients — adapts the step size; ε avoids ÷0

📖 Reads as: Combine momentum (direction) and RMSProp (adaptive size) into one robust update.

🎮 Try it — interactive demo

💡 Simplified

Adaptive step size AND momentum together — robust defaults that ‘just work’.

📘 Examples & References

Kingma & Ba 2014; defaults β₁=0.9, β₂=0.999, η=1e-3.

⚙️ Real-World Values & Applications

The most widely used optimiser in deep learning; trains transformers, GANs, and most published models.

🧮 The Formula — read as a table

Adam step + separate λθ weight shrink

Piece	In plain words
Adam step	the usual adaptive momentum update
+ λθ	pull every weight toward zero, applied directly (decoupled)
decoupled	the shrink is NOT folded into the gradient — so it works correctly

📖 Reads as: Adam, but the 'keep weights small' penalty is kept separate so it actually does its job.

🎮 Try it — interactive demo

💡 Simplified

Adam done right — keeps the weight-shrinking penalty separate so it works as intended.

📘 Examples & References

Loshchilov & Hutter 2017.

⚙️ Real-World Values & Applications

The standard optimiser for training LLMs (BERT, GPT, LLaMA) — better generalisation than vanilla Adam.

🧮 The Formula — read as a table

ηₜ = ½ η₀ ( 1 + cos(π t / T) ) (cosine decay)

Piece	In plain words
η₀	the starting learning rate
cos(π t/T)	smoothly falls from +1 to −1 over training
ηₜ	so the rate eases from η₀ down to 0

📖 Reads as: Start cautious, speed up, then slow down to settle — like easing off the gas near your destination.

🎮 Try it — interactive demo

💡 Simplified

Start cautious, go fast, then slow down to settle.

📘 Examples & References

Warmup + cosine decay is standard for transformers.

⚙️ Real-World Values & Applications

Critical for LLM training stability; warmup prevents early divergence; cosine decay squeezes out final accuracy.

🧮 The Formula — read as a table

Loss + (λ/2) ‖θ‖²

(λ/2) ‖θ‖²

added to Loss

Part	In plain words
▲ Top (λ/2) ‖θ‖²	a penalty that grows with the size of the weights
▼ Bottom added to Loss	so training is pushed to keep weights small AND fit the data

📖 Reads as: Add a penalty for big weights so the model stays simple and generalises.

🎮 Try it — interactive demo

💡 Simplified

Gently pull every weight toward zero so the model stays simple.

📘 Examples & References

λ = 0.01 is a common value.

⚙️ Real-World Values & Applications

Reduces overfitting in nearly every trained model; in AdamW it’s the main regulariser for LLMs.

🧮 The Formula — read as a table

stop when validation loss hasn't improved for p epochs

Piece	In plain words
validation loss	error on held-out data
patience p	how many stalled epochs to tolerate
stop	quit before the model starts memorising

📖 Reads as: Quit while you're ahead — stop once held-out error stops improving.

🎮 Try it — interactive demo

💡 Simplified

Quit while you’re ahead — stop before the model memorises the training set.

📘 Examples & References

Patience of 5–10 epochs is typical.

⚙️ Real-World Values & Applications

Cheap, effective overfitting guard; saves compute; used everywhere from Kaggle to production.

🧮 The Formula — read as a table

search over { LR, depth, batch size, … } for best result

Piece	In plain words
hyperparameters	settings you choose, not learned by gradient
search	grid / random / model-based exploration
best result	the config with the best validation score

📖 Reads as: Tune the hand-set knobs (learning rate, depth, …) to find the best-performing model.

🎮 Try it — interactive demo

💡 Simplified

Tune the ‘knobs’ you set by hand to get the best model.

📘 Examples & References

Random search often beats grid search (Bergstra & Bengio 2012).

⚙️ Real-World Values & Applications

Optuna / Ray Tune in practice; can swing accuracy by points — the gap between mediocre and winning.

🧮 The Formula — read as a table

surrogate model + acquisition(next trial)

Piece	In plain words
surrogate	a cheap model guessing how each config performs
acquisition	picks the most promising config to try next
update	retrain the guess after each real trial

📖 Reads as: Learn a cheap guess of how settings perform, then test the most promising one next.

🎮 Try it — interactive demo

💡 Simplified

Learn a cheap guess of performance, then test the most promising setting next.

📘 Examples & References

GPyOpt, Optuna’s TPE, Google Vizier.

⚙️ Real-World Values & Applications

Tunes expensive models (each trial = hours of GPU) in few trials; used for AutoML and even chemistry/hardware design.

Probability & Statistics

Random Variables

🧮 The Formula — read as a table

X : outcome → number (PMF p(x) or PDF f(x))

Piece	In plain words
outcome	a result of a chance experiment
X	maps that outcome to a number
p(x) / f(x)	how likely each value is (discrete / continuous)

📖 Reads as: A number whose value depends on chance, with a rule for how likely each value is.

🎮 Try it — interactive demo

💡 Simplified

A number whose value depends on chance — a die roll, tomorrow’s temperature.

📘 Examples & References

X = sum of two dice, ranging 2–12.

⚙️ Real-World Values & Applications

Model inputs, labels, and predictions as random variables; uncertainty estimates; sampling in generative models.

🧮 The Formula — read as a table

P(X=1) = p , P(X=0) = 1 − p

Piece	In plain words
p	chance of a '1' (success)
1 − p	chance of a '0' (failure)
mean = p	the long-run fraction of 1s

📖 Reads as: One yes/no trial with success probability p.

🎮 Try it — interactive demo

💡 Simplified

One coin flip with a possibly-biased coin.

📘 Examples & References

p = 0.5 for a fair coin.

⚙️ Real-World Values & Applications

Binary classification output (spam/not); each pixel of a binarised image; click/no-click modelling.

🧮 The Formula — read as a table

P(X=k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ

Piece	In plain words
C(n,k)	number of ways to pick which k trials succeed
pᵏ	probability those k succeed
(1−p)ⁿ⁻ᵏ	probability the other n−k fail
mean = np	expected number of successes

📖 Reads as: The chance of getting exactly k successes in n independent yes/no trials.

🎮 Try it — interactive demo

💡 Simplified

How many heads in n coin flips.

📘 Examples & References

10 flips, p=0.5: expect 5 heads.

⚙️ Real-World Values & Applications

Conversion counts in A/B tests; defect counts in QA; correct predictions out of n.

🧮 The Formula — read as a table

P(X=k) = ( λᵏ · e^(−λ) ) / k!

λᵏ · e^(−λ)

Part	In plain words
▲ Top λᵏ · e^(−λ)	weight for seeing k events when the average rate is λ
▼ Bottom k!	divide by k-factorial (the count's arrangements)

📖 Reads as: The chance of k rare events in a fixed window when the average count is λ.

🎮 Try it — interactive demo

💡 Simplified

How many rare things happen in a window — calls per hour, typos per page.

📘 Examples & References

λ = 3 emails/hour.

⚙️ Real-World Values & Applications

Arrival rates (web traffic, network packets); count data in NLP; queueing systems.

🧮 The Formula — read as a table

f(x) = 1 / (b − a) on [a, b]

b − a

Part	In plain words
▲ Top 1	equal weight for every value
▼ Bottom b − a	spread that weight evenly across the whole range

📖 Reads as: Every value in the range is equally likely — total fairness.

🎮 Try it — interactive demo

💡 Simplified

Total fairness — every value in the range is just as probable.

📘 Examples & References

A random float in [0, 1).

⚙️ Real-World Values & Applications

Random weight init, dropout masks, random seeds, Monte-Carlo sampling.

🧮 The Formula — read as a table

f(x) = ( 1 / √(2πσ²) ) · e^( −(x−μ)² / 2σ² )

e^( −(x−μ)²/2σ² )

√(2πσ²)

Part	In plain words
▲ Top e^( −(x−μ)²/2σ² )	peaks at the mean μ and falls off as you move away (curve shape)
▼ Bottom √(2πσ²)	a constant that makes the whole area equal 1

📖 Reads as: The bell curve: most values cluster near the mean μ, fewer appear as you go out by multiples of σ.

🎮 Try it — interactive demo

💡 Simplified

The classic bell curve — most values near the average, few far out.

📘 Examples & References

Heights, measurement noise; 68% fall within ±1σ.

⚙️ Real-World Values & Applications

Weight init, noise models, VAE priors, diffusion noise, Gaussian processes — the most important distribution in ML.

🧮 The Formula — read as a table

mean of many i.i.d. variables → Gaussian

Piece	In plain words
i.i.d. variables	many independent samples from the same source
their average	sum them and divide by n
→ Gaussian	that average looks bell-shaped, whatever the source

📖 Reads as: Average enough independent random things and the result looks like a bell curve — no matter the source.

🎮 Try it — interactive demo

💡 Simplified

Average enough random things and the result looks like a bell curve.

📘 Examples & References

The average of 30+ dice rolls is approximately normal.

⚙️ Real-World Values & Applications

Why Gaussian assumptions work so often; the basis of confidence intervals and many statistical tests.

🧮 The Formula — read as a table

E[X] = Σ x · p(x) ( ∫ x f(x) dx )

Piece	In plain words
x	each possible value
p(x)	its probability (the weight)
Σ x·p(x)	probability-weighted average → the long-run mean

📖 Reads as: The probability-weighted average — what you'd get on average over endless repeats.

🎮 Try it — interactive demo

💡 Simplified

The average value you’d get if you repeated the experiment forever.

📘 Examples & References

Fair die: E[X] = 3.5.

⚙️ Real-World Values & Applications

Expected loss is what training minimises; expected reward in RL; expected value drives every risk/decision calc.

🧮 The Formula — read as a table

Var(X) = E[ (X − μ)² ]

(X − μ)²

averaged

Part	In plain words
▲ Top (X − μ)²	squared distance of each value from the mean
▼ Bottom averaged	take the expected (mean) of those squared distances

📖 Reads as: The average squared distance from the mean — how spread out the values are.

🎮 Try it — interactive demo

💡 Simplified

How spread out the values are around the average — small = consistent, large = erratic.

📘 Examples & References

Fair die variance ≈ 2.92.

⚙️ Real-World Values & Applications

Bias–variance tradeoff governs over/underfitting; gradient variance affects stability; risk in finance.

🧮 The Formula — read as a table

Cov(X,Y) = E[ (X − μₓ)(Y − μᵧ) ]

Piece	In plain words
(X−μₓ)	how far X is from its mean
(Y−μᵧ)	how far Y is from its mean
their product, averaged	positive if they move together, negative if oppositely

📖 Reads as: Average the product of each pair's deviations — positive means they rise and fall together.

🎮 Try it — interactive demo

💡 Simplified

Do two quantities tend to rise and fall together (positive) or oppositely (negative)?

📘 Examples & References

Height and weight have positive covariance.

⚙️ Real-World Values & Applications

The covariance matrix drives PCA; portfolio risk; feature decorrelation and whitening.

🧮 The Formula — read as a table

ρ = Cov(X,Y) / ( σₓ · σᵧ )

Cov(X,Y)

σₓ · σᵧ

Part	In plain words
▲ Top Cov(X,Y)	how much X and Y move together
▼ Bottom σₓ · σᵧ	divide by their spreads → a clean score from −1 to +1

📖 Reads as: Covariance scaled by the two spreads, giving a tidy −1…+1 strength of linear relationship.

🎮 Try it — interactive demo

💡 Simplified

Covariance normalised to −1…+1; ±1 = perfect line, 0 = no linear link.

📘 Examples & References

ρ = 0.9 is a strong positive relationship.

⚙️ Real-World Values & Applications

Feature selection, detecting redundancy/leakage, exploratory analysis — but correlation ≠ causation.

🧮 The Formula — read as a table

draw x ~ distribution (inverse-CDF, rejection, MCMC)

Piece	In plain words
distribution	the probability pattern to draw from
x ~	generate values that follow that pattern
methods	inverse-CDF, rejection, Markov-chain Monte Carlo

📖 Reads as: Generate example values that follow a chosen probability pattern.

🎮 Try it — interactive demo

💡 Simplified

Generating example values that follow a chosen probability pattern.

📘 Examples & References

Sampling from N(0,1) with np.random.randn.

⚙️ Real-World Values & Applications

Bootstrapping for uncertainty; mini-batch selection; generating text/images from a model’s distribution.

🧮 The Formula — read as a table

x̄ ± z · ( σ / √n )

√n

Part	In plain words
▲ Top σ	the spread of the data
▼ Bottom √n	shrinks with more samples → a tighter interval

📖 Reads as: The sample mean plus/minus a margin that shrinks as you collect more data.

🎮 Try it — interactive demo

💡 Simplified

A plausible range for the true value, with a stated level of trust.

📘 Examples & References

A 95% CI uses z ≈ 1.96.

⚙️ Real-World Values & Applications

Reporting metric uncertainty (accuracy ± CI); A/B-test result ranges; scientific reproducibility.

🧮 The Formula — read as a table

compare H₀ (no effect) vs H₁ (effect)

Piece	In plain words
H₀	the boring default: nothing is happening
H₁	the claim: there's a real effect
test statistic	reject H₀ if the evidence is strong enough

📖 Reads as: A formal way to decide whether an observed effect is real or just luck.

🎮 Try it — interactive demo

💡 Simplified

A formal way to decide whether an observed effect is real or just luck.

📘 Examples & References

t-test, chi-square test, z-test.

⚙️ Real-World Values & Applications

Deciding if model B truly beats model A; if a feature matters; if an A/B variant won.

🧮 The Formula — read as a table

p = P( data this extreme | H₀ is true )

Piece	In plain words
H₀ true	assume nothing is really going on
data this extreme	chance of seeing a result as surprising as yours
small p	unlikely under H₀ → probably a real effect

📖 Reads as: The chance of seeing your result if nothing were really going on; small means 'probably real'.

🎮 Try it — interactive demo

💡 Simplified

The chance of seeing your result if nothing were going on — small means a likely real effect.

📘 Examples & References

p = 0.03 < 0.05 → statistically significant.

⚙️ Real-World Values & Applications

Gatekeeper for A/B-test decisions; widely misused (p-hacking) — significance ≠ practical importance.

🧮 The Formula — read as a table

randomly split users → A vs B → compare metric

Piece	In plain words
random split	half see A, half see B — fair comparison
metric	the number you care about (conversions, revenue)
compare	test if B's lift is significant, not luck

📖 Reads as: Show A to half your users and B to the other half, then measure which performs better — fairly.

🎮 Try it — interactive demo

💡 Simplified

Show version A to half your users and B to the other half, then measure which wins.

📘 Examples & References

Testing a new recommendation model against the current one.

⚙️ Real-World Values & Applications

How tech companies ship product and ML changes; needs proper sample size, randomisation, and no peeking.

🧮 The Formula — read as a table

θ̂ = argmax Π p(xᵢ | θ) (max log-likelihood)

Piece	In plain words
p(xᵢ\|θ)	how probable each data point is, given settings θ
Π (product)	multiply across all data points
argmax	pick the θ that makes the observed data most probable

📖 Reads as: Choose the parameters that make the data you actually observed as likely as possible.

🎮 Try it — interactive demo

💡 Simplified

Pick the parameters that make the data you actually observed most probable.

📘 Examples & References

MLE of a Gaussian’s mean is the sample average.

⚙️ Real-World Values & Applications

Training many models IS maximum likelihood — minimising cross-entropy = maximising label likelihood.

🧮 The Formula — read as a table

θ̂ = argmax p(x | θ) · p(θ)

Piece	In plain words
p(x\|θ)	likelihood: how well θ explains the data (as in MLE)
p(θ)	prior: your belief about reasonable θ before seeing data
argmax of product	balance the two → like MLE plus regularisation

📖 Reads as: Like MLE, but also weigh in prior beliefs about which parameters are reasonable.

🎮 Try it — interactive demo

💡 Simplified

Like MLE, but you also bring in prior beliefs about reasonable parameter values.

📘 Examples & References

A Gaussian prior on weights ⟺ L2 regularisation.

⚙️ Real-World Values & Applications

Explains why weight decay works; injects domain knowledge; stabilises estimates with little data.

🧮 The Formula — read as a table

p(θ | x) = ( p(x | θ) · p(θ) ) / p(x)

p(x | θ) · p(θ)

p(x)

Part	In plain words
▲ Top p(x \| θ) · p(θ)	likelihood × prior — evidence combined with belief
▼ Bottom p(x)	a normaliser so the posterior sums/integrates to 1

📖 Reads as: Update beliefs with data: posterior ∝ likelihood × prior; keep a full distribution, not one number.

🎮 Try it — interactive demo

💡 Simplified

Start with a belief, see evidence, revise it — and keep a full distribution, not just one number.

📘 Examples & References

Bayesian A/B testing; Bayesian neural networks.

⚙️ Real-World Values & Applications

Uncertainty-aware predictions (medicine, autonomy); Bayesian optimisation for tuning; naive-Bayes spam filters.