Kader Mohideen
  • About
  • Blog
  • Projects
  • Extra
    • ML Simplified
    • Library
    • Kader Library
    • ML Guide
    • Quest for AGI
    • AI Papers
  • CV

Topics

  • Linear Algebra
  • Calculus
  • Optimization
  • Probability & Statistics

ML Simplified

The math behind machine learning, explained four ways — every formula read as a plain-language table, each with an interactive demo. A static OneNote-style reference.
Note📓 How to read this notebook

Pick a subject tab, then a topic tab. Each topic has four panes — and the formula is shown as a plain-language table (fractions become top-over-bottom). Every Mathematical pane also has a 🎮 live interactive demo you can drag and play with.

Linear Algebra

  • Vectors
  • Matrices
  • Core Concepts
  • Vector Addition
  • Vector Subtraction
  • Dot Product
  • Cross Product
  • L1 Norm
  • L2 Norm
  • Projection
  • Orthogonality
🧮 The Formula — read as a table
a + b = [ a₁+b₁ , a₂+b₂ , … , aₙ+bₙ ]
Piece In plain words
a , b the two arrows (lists of numbers) you're combining
aᵢ + bᵢ add the matching slots, one position at a time
result a new arrow — the tip-to-tail diagonal of the two
📖 Reads as: Add two arrows by adding their matching numbers; the answer is where you land if you walk the first arrow then the second.

🎮 Try it — interactive demo

💡 Simplified

Stack two arrows tip-to-tail; where you end up is the sum. Order doesn’t matter.

📘 Examples & References

[1,2] + [3,1] = [4,3]. Reference: 3Blue1Brown, Essence of Linear Algebra Ch.1.

⚙️ Real-World Values & Applications

Summing gradient contributions across a batch; moving an embedding by a ‘direction’ (king − man + woman ≈ queen).

🧮 The Formula — read as a table
a − b = [ a₁−b₁ , a₂−b₂ , … , aₙ−bₙ ]
Piece In plain words
a − b subtract matching slots
result the arrow pointing FROM b TO a
📖 Reads as: Subtract matching numbers; the answer is the arrow that takes you from b to a — 'where you are minus where you were'.

🎮 Try it — interactive demo

💡 Simplified

The arrow that gets you from one point to another.

📘 Examples & References

[4,3] − [3,1] = [1,2].

⚙️ Real-World Values & Applications

Error vectors (prediction − target); displacement between two GPS points; embedding differences that capture relationships.

🧮 The Formula — read as a table
a · b = a₁b₁ + a₂b₂ + … + aₙbₙ = ‖a‖ ‖b‖ cos θ
Piece In plain words
a₁b₁ + a₂b₂ + … multiply matching slots, then add them all up
‖a‖ ‖b‖ the two arrows' lengths multiplied
cos θ how aligned they are: 1 = same way, 0 = perpendicular, −1 = opposite
result one number: how much the two arrows point the same way
📖 Reads as: Multiply-and-sum the matching numbers; the bigger the result, the more the two arrows agree in direction.

🎮 Try it — interactive demo

💡 Simplified

A single number measuring how much two vectors point the same way. Big = aligned, 0 = unrelated, negative = opposite.

📘 Examples & References

[1,2]·[3,1] = 1·3 + 2·1 = 5. Cosine similarity is the normalised dot product.

⚙️ Real-World Values & Applications

Every neuron computes a dot product of weights and inputs; attention scores are query·key dot products; recommender similarity.

🧮 The Formula — read as a table
a × b ⟂ both, ‖a × b‖ = ‖a‖ ‖b‖ sin θ
Piece In plain words
a × b a NEW arrow at right angles to both inputs (3D only)
‖a‖ ‖b‖ sin θ its length = area of the parallelogram a and b form
direction perpendicular to the plane of a and b (right-hand rule)
📖 Reads as: Combine two 3D arrows to get a third arrow that sticks straight out of the surface they span.

🎮 Try it — interactive demo

💡 Simplified

Multiply two 3D arrows to get a third arrow at right angles to both — the ‘normal’ to their plane.

📘 Examples & References

[1,0,0] × [0,1,0] = [0,0,1].

⚙️ Real-World Values & Applications

Surface normals in 3D graphics and robotics; torque and angular momentum in physics engines; camera orientation.

🧮 The Formula — read as a table
‖x‖₁ = |x₁| + |x₂| + … + |xₙ|
Piece In plain words
|xᵢ| drop the minus signs — take each number's size
sum add up all those sizes
result total 'city-block' distance
📖 Reads as: Add up the absolute sizes of every component — the distance if you can only travel along a grid.

🎮 Try it — interactive demo

💡 Simplified

Total distance if you can only move along grid streets, never diagonally.

📘 Examples & References

‖[3,−4]‖₁ = 3 + 4 = 7.

⚙️ Real-World Values & Applications

L1 (Lasso) regularisation pushes weights to exactly zero → automatic feature selection; robust error metric (MAE).

🧮 The Formula — read as a table
‖x‖₂ = √( x₁² + x₂² + … + xₙ² )
√( x₁² + x₂² + … )
(under the root)
Part In plain words
▲ Top
√( x₁² + x₂² + … )
square each number, add them, then take the square root
▼ Bottom
(under the root)
this is just the Pythagorean theorem in n dimensions
📖 Reads as: Square every component, add them, square-root the total — the straight-line length of the arrow.

🎮 Try it — interactive demo

💡 Simplified

The ‘as-the-crow-flies’ length of the arrow.

📘 Examples & References

‖[3,−4]‖₂ = √(9+16) = √25 = 5.

⚙️ Real-World Values & Applications

L2 / weight decay keeps weights small and smooth; normalising embeddings to unit length; gradient clipping by norm.

🧮 The Formula — read as a table
projꞵ a = ( a · b / ‖b‖² ) · b
a · b
‖b‖²
Part In plain words
▲ Top
a · b
how much a points along b (their dot product)
▼ Bottom
‖b‖²
b's length squared — divides out b's size so it's a fair ratio
📖 Reads as: Divide 'how much a leans toward b' by 'b's size', then scale b by that — the shadow a casts on b.

🎮 Try it — interactive demo

💡 Simplified

The shadow one arrow casts on another when light shines straight down on it.

📘 Examples & References

Projecting [2,2] onto the x-axis [1,0] gives [2,0].

⚙️ Real-World Values & Applications

PCA projects data onto principal directions; least-squares regression projects targets onto the feature space.

🧮 The Formula — read as a table
a ⟂ b ⟺ a · b = 0
Piece In plain words
a · b = 0 their dot product is exactly zero
⟂ the arrows meet at 90° — perpendicular
meaning they share no common direction at all
📖 Reads as: If the dot product is zero, the two arrows are at right angles and completely unrelated in direction.

🎮 Try it — interactive demo

💡 Simplified

Two directions that share no overlap — knowing one tells you nothing about the other.

📘 Examples & References

[1,0]·[0,1] = 0 → the axes are orthogonal.

⚙️ Real-World Values & Applications

Orthogonal weight init prevents signal blow-up; orthogonal bases (Fourier, wavelets) for compression; decorrelated features.

  • Matrix Multiplication
  • Transpose
  • Inverse
  • Determinant
  • Rank
  • Trace
🧮 The Formula — read as a table
(AB)ᵢⱼ = Aᵢ₁B₁ⱼ + Aᵢ₂B₂ⱼ + … + AᵢₙBₙⱼ
Piece In plain words
row i of A one horizontal strip from the first matrix
column j of B one vertical strip from the second matrix
dot product multiply them slot-by-slot and sum → cell (i,j) of the answer
rule inner sizes must match: (m×n)(n×p) = (m×p)
📖 Reads as: Each output cell is the dot product of a row from A and a column from B; it chains two transformations into one.

🎮 Try it — interactive demo

💡 Simplified

Every output cell is a row·column dot product. It glues two linear transformations together.

📘 Examples & References

[[1,2],[3,4]] · [1,1]ᵀ = [3,7]ᵀ.

⚙️ Real-World Values & Applications

The heart of deep learning — every dense layer is Wx+b; ~99% of training compute is matrix multiplication (why GPUs exist).

🧮 The Formula — read as a table
(Aᵀ)ᵢⱼ = Aⱼᵢ (AB)ᵀ = Bᵀ Aᵀ
Piece In plain words
Aⱼᵢ swap the row and column index of every entry
effect rows become columns, columns become rows (flip across the diagonal)
📖 Reads as: Flip the matrix across its diagonal so rows turn into columns.

🎮 Try it — interactive demo

💡 Simplified

Turn rows into columns and columns into rows.

📘 Examples & References

[[1,2],[3,4]]ᵀ = [[1,3],[2,4]].

⚙️ Real-World Values & Applications

Backprop multiplies by Wᵀ to push gradients backward; Gram matrices XᵀX for covariance and style transfer.

🧮 The Formula — read as a table
A⁻¹ A = I (exists if det A ≠ 0)
Piece In plain words
A⁻¹ the 'undo' matrix
A⁻¹A = I applying it after A returns you to where you started (identity)
det A ≠ 0 only invertible if the matrix doesn't squash space flat
📖 Reads as: The inverse is the matrix that undoes A; it only exists when A doesn't collapse any dimension.

🎮 Try it — interactive demo

💡 Simplified

The ‘undo’ matrix — apply it to reverse what A did.

📘 Examples & References

[[2,0],[0,2]]⁻¹ = [[0.5,0],[0,0.5]].

⚙️ Real-World Values & Applications

Closed-form regression β̂ = (XᵀX)⁻¹Xᵀy; Kalman filters; in practice we solve systems rather than invert (faster, stabler).

🧮 The Formula — read as a table
det [[a,b],[c,d]] = a·d − b·c
Piece In plain words
a·d product of the main diagonal
b·c product of the off-diagonal
difference the signed area/volume scaling factor; 0 = space squashed flat
📖 Reads as: Multiply the diagonals and subtract; the result tells you how much the matrix stretches area (zero = it flattens space).

🎮 Try it — interactive demo

💡 Simplified

How much the transformation stretches or shrinks space. Zero means it squashes everything flat.

📘 Examples & References

det [[1,2],[3,4]] = 1·4 − 2·3 = −2.

⚙️ Real-World Values & Applications

Normalising flows use log|det J|; checking invertibility; computing Gaussian densities.

🧮 The Formula — read as a table
rank(A) = number of independent rows (or columns)
Piece In plain words
independent rows rows that can't be built from the others
count how many genuinely different directions A can reach
low rank lots of redundancy — the matrix is 'simpler' than its size
📖 Reads as: Count how many rows are truly different; that's how many independent directions the matrix spans.

🎮 Try it — interactive demo

💡 Simplified

How many genuinely different directions the matrix can reach. Low rank = redundancy.

📘 Examples & References

[[1,2],[2,4]] has rank 1 (row 2 = 2 × row 1).

⚙️ Real-World Values & Applications

Low-rank factorisation (LoRA!) fine-tunes LLMs with tiny matrices; recommenders factor the ratings matrix; PCA.

🧮 The Formula — read as a table
tr(A) = A₁₁ + A₂₂ + … + Aₙₙ = Σ eigenvalues
Piece In plain words
Aᵢᵢ the diagonal entries
sum just add them up
= Σ eigenvalues it also equals the sum of all eigenvalues
📖 Reads as: Add up the diagonal entries — which happens to equal the sum of the eigenvalues.

🎮 Try it — interactive demo

💡 Simplified

Just add up the diagonal entries.

📘 Examples & References

tr [[1,2],[3,4]] = 1 + 4 = 5.

⚙️ Real-World Values & Applications

Simplifies gradient derivations (cyclic identities); trace of covariance = total variance; information measures.

  • Linear Independence
  • Basis
  • Subspaces
  • Eigenvalues
  • Eigenvectors
  • Spectral Decomposition
  • SVD
  • QR Decomposition
  • Positive Definite Matrices
  • Tensor Algebra
🧮 The Formula — read as a table
c₁v₁ + c₂v₂ + … + cₙvₙ = 0 only when all cᵢ = 0
Piece In plain words
Σ cᵢvᵢ = 0 a weighted combination of the vectors equals the zero vector
all cᵢ = 0 the ONLY way to get zero is to use no vectors at all
meaning none of the vectors is redundant
📖 Reads as: If the only way to combine the vectors into zero is to multiply them all by zero, none is redundant — they're independent.

🎮 Try it — interactive demo

💡 Simplified

No vector in the set can be built from the others — none is wasted.

📘 Examples & References

[1,0] and [0,1] are independent; [1,0] and [2,0] are not.

⚙️ Real-World Values & Applications

Independent features avoid multicollinearity (unstable coefficients); guarantee a unique least-squares solution.

🧮 The Formula — read as a table
any v = c₁e₁ + c₂e₂ + … + cₙeₙ (unique cᵢ)
Piece In plain words
e₁ … eₙ the building-block directions (independent + spanning)
cᵢ the unique coordinates of v in that basis
any v every vector in the space can be built this way
📖 Reads as: A basis is the smallest set of directions from which every vector can be uniquely rebuilt.

🎮 Try it — interactive demo

💡 Simplified

The smallest set of building-block directions you can construct everything else from.

📘 Examples & References

Standard basis of 3D space: [1,0,0], [0,1,0], [0,0,1].

⚙️ Real-World Values & Applications

Embedding dimensions are a learned basis for meaning; Fourier/wavelet bases for compression; good basis = good features.

🧮 The Formula — read as a table
closed: u,v ∈ S ⟹ u+v ∈ S and c·u ∈ S
Piece In plain words
u+v ∈ S add two members → still inside
c·u ∈ S scale a member → still inside
always contains 0 every subspace passes through the origin
📖 Reads as: A subspace is a flat slice (line or plane through the origin) that you can't escape by adding or scaling its members.

🎮 Try it — interactive demo

💡 Simplified

A flat ‘slice’ of space — a line or plane through the origin — that stays inside itself.

📘 Examples & References

The x–y plane is a 2D subspace of 3D space.

⚙️ Real-World Values & Applications

Data often lives on a low-dimensional subspace (manifold) inside high-dim space — the basis of dimensionality reduction.

🧮 The Formula — read as a table
A v = λ v found via det(A − λI) = 0
Piece In plain words
A v apply the matrix to a special vector v
λ v the result is just v scaled by a number λ (no rotation)
λ the eigenvalue — the stretch factor along that direction
📖 Reads as: An eigenvalue λ is the amount a matrix stretches one of its special, un-rotated directions.

🎮 Try it — interactive demo

💡 Simplified

The scaling factors along the special directions a matrix doesn’t rotate.

📘 Examples & References

[[2,0],[0,3]] has eigenvalues 2 and 3.

⚙️ Real-World Values & Applications

PCA’s eigenvalues = variance per component; PageRank is the top eigenvector; spectral norm bounds training stability.

🧮 The Formula — read as a table
A v = λ v , v ≠ 0
Piece In plain words
v a direction the matrix only stretches, never turns
λ how much it stretches that direction
A v = λ v applying A keeps v on the same line
📖 Reads as: An eigenvector is a direction that a transformation merely stretches without turning.

🎮 Try it — interactive demo

💡 Simplified

The directions a transformation only stretches without rotating.

📘 Examples & References

For [[2,0],[0,3]], the eigenvectors are [1,0] and [0,1].

⚙️ Real-World Values & Applications

Principal components are eigenvectors of the covariance matrix; eigenfaces for face recognition; vibration modes.

🧮 The Formula — read as a table
A = Q Λ Qᵀ (A symmetric)
Piece In plain words
Q rotation: the orthonormal eigenvectors
Λ scaling: a diagonal of eigenvalues
Qᵀ rotate back
📖 Reads as: A symmetric matrix can be rewritten as: rotate → stretch each axis → rotate back.

🎮 Try it — interactive demo

💡 Simplified

Rewrite a symmetric matrix as rotate, scale each axis, then rotate back.

📘 Examples & References

Covariance matrices are symmetric, so they always decompose this way.

⚙️ Real-World Values & Applications

Whitening / decorrelating data; matrix square-roots and powers; the math behind PCA and Gaussian processes.

🧮 The Formula — read as a table
A = U Σ Vᵀ (works for ANY matrix)
Piece In plain words
V ᵀ first rotation (in the input space)
Σ stretch along axes by the singular values
U second rotation (in the output space)
📖 Reads as: Any matrix at all is: rotate → stretch → rotate. The stretch amounts are the singular values.

🎮 Try it — interactive demo

💡 Simplified

The universal decomposition: rotate, stretch along axes, rotate again — for ANY matrix.

📘 Examples & References

Keeping the top-k singular values gives the best rank-k approximation (Eckart–Young).

⚙️ Real-World Values & Applications

Latent semantic analysis, image compression, Netflix-prize recommenders, and the theory behind low-rank LLM adapters.

🧮 The Formula — read as a table
A = Q R
Piece In plain words
Q orthonormal columns — a clean rotation part
R upper-triangular — the 'bookkeeping' part
use solve A x = b stably without forming A⁻¹
📖 Reads as: Split a matrix into a clean rotation (Q) and a triangular remainder (R).

🎮 Try it — interactive demo

💡 Simplified

Split a matrix into a clean rotation part and a triangular bookkeeping part.

📘 Examples & References

Used to solve A x = b stably; the QR algorithm finds eigenvalues.

⚙️ Real-World Values & Applications

Numerically stable least-squares; eigenvalue computation; orthogonalising layers in deep nets.

🧮 The Formula — read as a table
xᵀ A x > 0 for all x ≠ 0 ⟺ all eigenvalues > 0
Piece In plain words
xᵀ A x a quadratic 'energy' score for any input x
> 0 always positive — the surface curves upward everywhere
eigenvalues > 0 equivalent condition: every stretch factor is positive
📖 Reads as: If the matrix gives a positive 'energy' for every input, it's a perfect bowl with one lowest point.

🎮 Try it — interactive demo

💡 Simplified

A matrix that always curves upward — like a perfect bowl with a single lowest point.

📘 Examples & References

The identity matrix is positive definite; any covariance matrix is at least semi-definite.

⚙️ Real-World Values & Applications

Guarantees a unique optimum in convex problems; enables Cholesky factorisation; kernel matrices in SVMs and GPs.

🧮 The Formula — read as a table
shape (batch, height, width, channels) — n axes
Piece In plain words
scalar 0 axes — a single number
vector / matrix 1 / 2 axes
tensor 3+ axes — e.g. a batch of colour images
contraction the einsum generalisation of matrix multiply
📖 Reads as: Tensors are arrays with any number of axes; deep-learning data and weights are tensors.

🎮 Try it — interactive demo

💡 Simplified

Arrays with more than two axes — e.g. a batch of colour images is a 4D tensor.

📘 Examples & References

A 224×224 RGB batch of 32 images has shape (32, 224, 224, 3).

⚙️ Real-World Values & Applications

Every DL framework (PyTorch, TensorFlow) is a tensor engine; attention runs on 4D tensors; einsum expresses contractions.

Calculus

  • Foundations
  • Limits
  • Derivatives
  • Partial Derivatives
  • Chain Rule
  • Gradients
  • Jacobian
  • Hessian
  • Taylor Series
  • Integration
  • Numerical Differentiation
  • Numerical Optimization
🧮 The Formula — read as a table
lim (x→a) f(x) = L
Piece In plain words
x → a let the input creep toward a
f(x) → L the output settles toward L
meaning the value the function heads to, even if it never arrives
📖 Reads as: As the input approaches a, the output approaches L — the value the function is heading toward.

🎮 Try it — interactive demo

💡 Simplified

What value a function is heading toward as the input approaches a point.

📘 Examples & References

lim (x→0) sin(x)/x = 1.

⚙️ Real-World Values & Applications

Underpins why gradient-descent steps work; learning-rate decay ‘in the limit’; defining derivatives that power backprop.

🧮 The Formula — read as a table
f′(x) = lim (h→0) [ f(x+h) − f(x) ] / h
f(x+h) − f(x)
h
Part In plain words
▲ Top
f(x+h) − f(x)
how much the output changes over a tiny step h
▼ Bottom
h
the size of that tiny step (shrunk toward zero)
📖 Reads as: Change in output divided by a vanishingly small change in input — the slope right at this point.

🎮 Try it — interactive demo

💡 Simplified

How fast the output changes when you nudge the input — the steepness right here.

📘 Examples & References

d/dx (x²) = 2x; at x=3 the slope is 6.

⚙️ Real-World Values & Applications

The gradient (vector of derivatives) tells each weight which way and how much to move to cut the loss.

🧮 The Formula — read as a table
∂f/∂xᵢ = slope in the xᵢ direction (others fixed)
∂f
∂xᵢ
Part In plain words
▲ Top
∂f
tiny change in the output
▼ Bottom
∂xᵢ
tiny change in ONE input, holding all the rest frozen
📖 Reads as: The output's slope when you wiggle just one input and freeze every other.

🎮 Try it — interactive demo

💡 Simplified

Change in the output when you wiggle one input and freeze the rest.

📘 Examples & References

f = x²y → ∂f/∂x = 2xy, ∂f/∂y = x².

⚙️ Real-World Values & Applications

Nets have millions of parameters; each weight update uses the partial derivative of the loss w.r.t. that weight.

🧮 The Formula — read as a table
d/dx f(g(x)) = f′(g(x)) · g′(x)
Piece In plain words
g′(x) slope of the inner function
f′(g(x)) slope of the outer function, measured at g(x)
multiply chain the slopes together, outside-in
📖 Reads as: To differentiate nested functions, multiply the outer slope by the inner slope.

🎮 Try it — interactive demo

💡 Simplified

For nested functions, multiply the slopes of each layer together, outside-in.

📘 Examples & References

d/dx sin(x²) = cos(x²) · 2x.

⚙️ Real-World Values & Applications

Backpropagation IS the chain rule applied layer by layer — the single most important formula in deep learning.

🧮 The Formula — read as a table
∇f = [ ∂f/∂x₁ , ∂f/∂x₂ , … , ∂f/∂xₙ ]
Piece In plain words
each ∂f/∂xᵢ the slope in one input direction
vector bundle all the slopes into one arrow
∇f points toward the steepest increase
📖 Reads as: Bundle every partial slope into one arrow; it points uphill the fastest — so step the other way to go down.

🎮 Try it — interactive demo

💡 Simplified

An arrow pointing toward the fastest increase; step opposite to go downhill.

📘 Examples & References

f = x² + y² → ∇f = [2x, 2y].

⚙️ Real-World Values & Applications

Gradient descent moves parameters along −∇Loss; the whole training loop is ‘compute gradient, take a step’.

🧮 The Formula — read as a table
Jᵢⱼ = ∂fᵢ/∂xⱼ (matrix of all partials)
∂fᵢ
∂xⱼ
Part In plain words
▲ Top
∂fᵢ
change in output number i
▼ Bottom
∂xⱼ
change in input number j — filled in for every (i,j) pair
📖 Reads as: A full table of how every output reacts to every input — the gradient generalised to many outputs.

🎮 Try it — interactive demo

💡 Simplified

A table of how every output reacts to every input.

📘 Examples & References

f(x,y) = [x², xy] → J = [[2x, 0],[y, x]].

⚙️ Real-World Values & Applications

Normalising flows use log|det J|; sensitivity analysis; vector-Jacobian products power autodiff backprop.

🧮 The Formula — read as a table
Hᵢⱼ = ∂²f / ∂xᵢ ∂xⱼ (curvature)
∂²f
∂xᵢ ∂xⱼ
Part In plain words
▲ Top
∂²f
the SECOND change in the output
▼ Bottom
∂xᵢ ∂xⱼ
with respect to a pair of inputs — how the slope itself bends
📖 Reads as: The second-derivative table: tells you whether you're in a bowl, a dome, or a saddle.

🎮 Try it — interactive demo

💡 Simplified

Tells you how the slope itself is changing — bowl, dome, or saddle.

📘 Examples & References

f = x² + y² → H = [[2,0],[0,2]] (positive definite → a bowl).

⚙️ Real-World Values & Applications

Second-order optimisers (Newton, L-BFGS) use it; its eigenvalues diagnose saddle points that slow training.

🧮 The Formula — read as a table
f(x) ≈ f(a) + f′(a)(x−a) + ½ f″(a)(x−a)² + …
Piece In plain words
f(a) start from the value at a
f′(a)(x−a) add the slope term (linear correction)
½ f″(a)(x−a)² add the curvature term (quadratic correction)
📖 Reads as: Approximate any smooth curve near a point using its value, slope, curvature, and so on.

🎮 Try it — interactive demo

💡 Simplified

Approximate any smooth curve near a point using a polynomial built from its derivatives.

📘 Examples & References

eˣ ≈ 1 + x + x²/2 near 0.

⚙️ Real-World Values & Applications

Justifies gradient descent (first-order) and Newton’s method (second-order); solver approximations.

🧮 The Formula — read as a table
∫ₐᵇ f(x) dx = signed area under the curve
Piece In plain words
f(x) dx a thin slice: height × tiny width
∫ₐᵇ add up infinitely many slices from a to b
result total accumulated area / quantity / probability
📖 Reads as: Add up infinitely many thin slices under the curve to get a total.

🎮 Try it — interactive demo

💡 Simplified

Add up infinitely many thin slices to get a total — area, accumulated quantity, or probability.

📘 Examples & References

∫₀¹ x dx = ½.

⚙️ Real-World Values & Applications

Expected values and probabilities are integrals; the ELBO in VAEs; diffusion models integrate an SDE to generate images.

🧮 The Formula — read as a table
f′(x) ≈ [ f(x+h) − f(x−h) ] / (2h)
f(x+h) − f(x−h)
2h
Part In plain words
▲ Top
f(x+h) − f(x−h)
measure the function a little ahead and a little behind
▼ Bottom
2h
divide by the total gap between those two points
📖 Reads as: Estimate a slope by sampling the function just ahead and just behind, then dividing by the gap.

🎮 Try it — interactive demo

💡 Simplified

Estimate a slope by measuring the function at nearby points — no formula needed.

📘 Examples & References

With h=0.01, central difference of x² at 3 ≈ 6.00.

⚙️ Real-World Values & Applications

Gradient-checking to verify hand-written backprop; finite-difference sensitivities when analytic gradients are missing.

🧮 The Formula — read as a table
θ ← θ − (step) · (search direction), repeat
Piece In plain words
start θ an initial guess
search direction which way improves the objective (e.g. −gradient)
repeat keep stepping until it stops improving
📖 Reads as: When you can't solve for the best answer directly, search for it step by step, improving each round.

🎮 Try it — interactive demo

💡 Simplified

Search for the best answer step by step, improving each iteration.

📘 Examples & References

Minimising a non-linear least-squares loss with Levenberg–Marquardt.

⚙️ Real-World Values & Applications

Training every neural network; calibrating physics models; portfolio optimisation; hyperparameter search.

Optimization

  • Objectives & Losses
  • Gradient Descent
  • Optimizers
  • Tuning
  • Objective Functions
  • Loss Functions
  • Convex Optimization
  • Non-Convex Optimization
🧮 The Formula — read as a table
minimise (or maximise) f(θ) over parameters θ
Piece In plain words
θ the knobs the model can change
f(θ) a single score measuring how good those settings are
min / max push that score the best direction
📖 Reads as: A single 'goodness' score the whole search tries to make as good as possible.

🎮 Try it — interactive demo

💡 Simplified

The single score you’re trying to make as good as possible.

📘 Examples & References

Maximise reward in RL; minimise prediction error in supervised learning.

⚙️ Real-World Values & Applications

Defines what ‘good’ means — choosing it wrong (clicks not satisfaction) causes misaligned systems.

🧮 The Formula — read as a table
MSE = (1/n) Σ (y − ŷ)² CE = − Σ y log ŷ
Piece In plain words
y the true answer
ŷ the model's prediction
(y−ŷ)² or y log ŷ penalty that grows as the prediction gets worse
average mean penalty over all examples
📖 Reads as: A number that's big when the model is wrong and small when it's right; training drives it down.

🎮 Try it — interactive demo

💡 Simplified

A number that’s big when the model is wrong and small when it’s right.

📘 Examples & References

Cross-entropy for classification; MSE for regression; contrastive loss for embeddings.

⚙️ Real-World Values & Applications

Loss choice shapes behaviour: MSE punishes outliers hard; cross-entropy suits probabilities; focal loss for imbalance.

🧮 The Formula — read as a table
any local minimum = the global minimum
Piece In plain words
convex the surface is a single smooth valley (f″ ≥ 0)
local min wherever you stop rolling downhill
= global min that stop is guaranteed to be THE best point
📖 Reads as: On a convex (bowl-shaped) surface, any bottom you reach is the one true bottom.

🎮 Try it — interactive demo

💡 Simplified

A single smooth valley — wherever you roll downhill you reach the one true bottom.

📘 Examples & References

Linear/logistic regression and SVMs are convex.

⚙️ Real-World Values & Applications

Convex problems solve reliably and provably — used in finance, control, and as the well-behaved core of larger systems.

🧮 The Formula — read as a table
many local minima, saddles, plateaus — no global guarantee
Piece In plain words
local minima many 'good enough' valleys
saddle points flat spots that aren't minima
no guarantee you might not find the single deepest valley
📖 Reads as: A bumpy landscape of many dips; you may settle in a good valley, not necessarily the deepest.

🎮 Try it — interactive demo

💡 Simplified

A bumpy mountain range full of dips — you might settle in a ‘good enough’ valley.

📘 Examples & References

Every deep neural network loss surface is non-convex.

⚙️ Real-World Values & Applications

Deep learning works anyway: in high dimensions most local minima are nearly as good, and SGD noise escapes bad spots.

  • Batch Gradient Descent
  • SGD
  • Mini-Batch SGD
🧮 The Formula — read as a table
θ ← θ − η · ∇Loss(whole dataset)
Piece In plain words
∇Loss(all data) average advice from every single example
η the learning rate (step size)
θ ← θ − … step downhill once per full pass
📖 Reads as: Look at every example, average their advice, then take one careful step downhill.

🎮 Try it — interactive demo

💡 Simplified

Look at every example, average the advice, then take one careful step. Accurate but slow.

📘 Examples & References

Used when datasets fit in memory and a smooth path matters.

⚙️ Real-World Values & Applications

Rare for large data (one step = a full pass); used in classical ML and as a baseline.

🧮 The Formula — read as a table
θ ← θ − η · ∇Loss(one random example)
Piece In plain words
one example a single randomly-picked data point
∇Loss its noisy gradient
step update after every single example
📖 Reads as: Step after every single example — noisy and jittery, but fast and able to escape traps.

🎮 Try it — interactive demo

💡 Simplified

Take a step after every example — noisy but fast, and able to escape traps.

📘 Examples & References

Robbins–Monro stochastic approximation (1951).

⚙️ Real-World Values & Applications

The noise acts as a regulariser; foundation of modern training, usually replaced by mini-batches in practice.

🧮 The Formula — read as a table
θ ← θ − η · (1/B) Σ ∇Loss(batch of B)
Σ ∇Loss over B examples
B
Part In plain words
▲ Top
Σ ∇Loss over B examples
add up the gradients of a small handful of examples
▼ Bottom
B
divide by the batch size to average them
📖 Reads as: Average the gradient over a small batch (e.g. 256), then step — stable enough and GPU-friendly.

🎮 Try it — interactive demo

💡 Simplified

Look at a small handful each step — stable enough, and perfect for GPU parallelism.

📘 Examples & References

Batch size 256 is a common default.

⚙️ Real-World Values & Applications

The actual workhorse of deep learning; batch size trades gradient noise vs hardware throughput.

  • Momentum
  • Nesterov Momentum
  • AdaGrad
  • RMSProp
  • Adam
  • AdamW
  • Learning Rate Scheduling
  • Weight Decay
  • Early Stopping
🧮 The Formula — read as a table
v ← β v + ∇Loss ; θ ← θ − η v (β ≈ 0.9)
Piece In plain words
β v keep 90% of last step's velocity
+ ∇Loss add the new gradient
θ ← θ − η v move along the built-up velocity
📖 Reads as: Build up speed downhill like a rolling ball, pushing through small bumps and flat spots.

🎮 Try it — interactive demo

💡 Simplified

Build up speed downhill like a rolling ball — push through bumps and flat spots.

📘 Examples & References

Polyak’s heavy-ball method.

⚙️ Real-World Values & Applications

Speeds convergence in ravines; SGD+momentum often beats fancier optimisers for vision.

🧮 The Formula — read as a table
gradient measured at θ − η β v (look ahead)
Piece In plain words
θ − ηβv peek where momentum is about to carry you
gradient there measure the slope at that future spot
correct early adjust before overshooting
📖 Reads as: Peek where momentum is about to take you, measure the slope there, and correct early.

🎮 Try it — interactive demo

💡 Simplified

Peek where momentum is about to carry you and correct early — an anticipatory ball.

📘 Examples & References

Nesterov accelerated gradient (1983).

⚙️ Real-World Values & Applications

Slightly faster and more stable than plain momentum; a flag in every framework.

🧮 The Formula — read as a table
step ∝ 1 / √( Σ past gradients² )
learning rate
√(Σ g²)
Part In plain words
▲ Top
learning rate
the base step size
▼ Bottom
√(Σ g²)
grows with accumulated gradient size → shrinks the step per parameter
📖 Reads as: Give rarely-updated parameters big steps and frequently-updated ones small steps.

🎮 Try it — interactive demo

💡 Simplified

Big steps for rare parameters, small steps for frequent ones.

📘 Examples & References

Great for sparse features (text, recommendations).

⚙️ Real-World Values & Applications

Strong early, but the ever-growing denominator eventually kills the step — which motivated RMSProp/Adam.

🧮 The Formula — read as a table
E[g²] ← γ E[g²] + (1−γ) g² ; step ∝ 1/√E[g²]
learning rate
√E[g²]
Part In plain words
▲ Top
learning rate
base step size
▼ Bottom
√E[g²]
root of a DECAYING average of squared gradients — forgets old ones
📖 Reads as: Like AdaGrad but forgets old gradients, so the step size never dies out.

🎮 Try it — interactive demo

💡 Simplified

Like AdaGrad but forgets old gradients, so the step doesn’t die out.

📘 Examples & References

Proposed by Hinton in a Coursera lecture.

⚙️ Real-World Values & Applications

Default for RNNs and reinforcement learning; handles non-stationary objectives well.

🧮 The Formula — read as a table
θ ← θ − η · m̂ / ( √v̂ + ε )
m̂ (momentum)
√v̂ + ε
Part In plain words
▲ Top
m̂ (momentum)
a decaying average of gradients — the direction with inertia
▼ Bottom
√v̂ + ε
a decaying average of squared gradients — adapts the step size; ε avoids ÷0
📖 Reads as: Combine momentum (direction) and RMSProp (adaptive size) into one robust update.

🎮 Try it — interactive demo

💡 Simplified

Adaptive step size AND momentum together — robust defaults that ‘just work’.

📘 Examples & References

Kingma & Ba 2014; defaults β₁=0.9, β₂=0.999, η=1e-3.

⚙️ Real-World Values & Applications

The most widely used optimiser in deep learning; trains transformers, GANs, and most published models.

🧮 The Formula — read as a table
Adam step + separate λθ weight shrink
Piece In plain words
Adam step the usual adaptive momentum update
+ λθ pull every weight toward zero, applied directly (decoupled)
decoupled the shrink is NOT folded into the gradient — so it works correctly
📖 Reads as: Adam, but the 'keep weights small' penalty is kept separate so it actually does its job.

🎮 Try it — interactive demo

💡 Simplified

Adam done right — keeps the weight-shrinking penalty separate so it works as intended.

📘 Examples & References

Loshchilov & Hutter 2017.

⚙️ Real-World Values & Applications

The standard optimiser for training LLMs (BERT, GPT, LLaMA) — better generalisation than vanilla Adam.

🧮 The Formula — read as a table
ηₜ = ½ η₀ ( 1 + cos(π t / T) ) (cosine decay)
Piece In plain words
η₀ the starting learning rate
cos(π t/T) smoothly falls from +1 to −1 over training
ηₜ so the rate eases from η₀ down to 0
📖 Reads as: Start cautious, speed up, then slow down to settle — like easing off the gas near your destination.

🎮 Try it — interactive demo

💡 Simplified

Start cautious, go fast, then slow down to settle.

📘 Examples & References

Warmup + cosine decay is standard for transformers.

⚙️ Real-World Values & Applications

Critical for LLM training stability; warmup prevents early divergence; cosine decay squeezes out final accuracy.

🧮 The Formula — read as a table
Loss + (λ/2) ‖θ‖²
(λ/2) ‖θ‖²
added to Loss
Part In plain words
▲ Top
(λ/2) ‖θ‖²
a penalty that grows with the size of the weights
▼ Bottom
added to Loss
so training is pushed to keep weights small AND fit the data
📖 Reads as: Add a penalty for big weights so the model stays simple and generalises.

🎮 Try it — interactive demo

💡 Simplified

Gently pull every weight toward zero so the model stays simple.

📘 Examples & References

λ = 0.01 is a common value.

⚙️ Real-World Values & Applications

Reduces overfitting in nearly every trained model; in AdamW it’s the main regulariser for LLMs.

🧮 The Formula — read as a table
stop when validation loss hasn't improved for p epochs
Piece In plain words
validation loss error on held-out data
patience p how many stalled epochs to tolerate
stop quit before the model starts memorising
📖 Reads as: Quit while you're ahead — stop once held-out error stops improving.

🎮 Try it — interactive demo

💡 Simplified

Quit while you’re ahead — stop before the model memorises the training set.

📘 Examples & References

Patience of 5–10 epochs is typical.

⚙️ Real-World Values & Applications

Cheap, effective overfitting guard; saves compute; used everywhere from Kaggle to production.

  • Hyperparameter Optimization
  • Bayesian Optimization
🧮 The Formula — read as a table
search over { LR, depth, batch size, … } for best result
Piece In plain words
hyperparameters settings you choose, not learned by gradient
search grid / random / model-based exploration
best result the config with the best validation score
📖 Reads as: Tune the hand-set knobs (learning rate, depth, …) to find the best-performing model.

🎮 Try it — interactive demo

💡 Simplified

Tune the ‘knobs’ you set by hand to get the best model.

📘 Examples & References

Random search often beats grid search (Bergstra & Bengio 2012).

⚙️ Real-World Values & Applications

Optuna / Ray Tune in practice; can swing accuracy by points — the gap between mediocre and winning.

🧮 The Formula — read as a table
surrogate model + acquisition(next trial)
Piece In plain words
surrogate a cheap model guessing how each config performs
acquisition picks the most promising config to try next
update retrain the guess after each real trial
📖 Reads as: Learn a cheap guess of how settings perform, then test the most promising one next.

🎮 Try it — interactive demo

💡 Simplified

Learn a cheap guess of performance, then test the most promising setting next.

📘 Examples & References

GPyOpt, Optuna’s TPE, Google Vizier.

⚙️ Real-World Values & Applications

Tunes expensive models (each trial = hours of GPU) in few trials; used for AutoML and even chemistry/hardware design.

Probability & Statistics

  • Foundations
  • Distributions
  • Moments & Relationships
  • Inference
  • Random Variables
🧮 The Formula — read as a table
X : outcome → number (PMF p(x) or PDF f(x))
Piece In plain words
outcome a result of a chance experiment
X maps that outcome to a number
p(x) / f(x) how likely each value is (discrete / continuous)
📖 Reads as: A number whose value depends on chance, with a rule for how likely each value is.

🎮 Try it — interactive demo

💡 Simplified

A number whose value depends on chance — a die roll, tomorrow’s temperature.

📘 Examples & References

X = sum of two dice, ranging 2–12.

⚙️ Real-World Values & Applications

Model inputs, labels, and predictions as random variables; uncertainty estimates; sampling in generative models.

  • Bernoulli
  • Binomial
  • Poisson
  • Uniform
  • Gaussian
🧮 The Formula — read as a table
P(X=1) = p , P(X=0) = 1 − p
Piece In plain words
p chance of a '1' (success)
1 − p chance of a '0' (failure)
mean = p the long-run fraction of 1s
📖 Reads as: One yes/no trial with success probability p.

🎮 Try it — interactive demo

💡 Simplified

One coin flip with a possibly-biased coin.

📘 Examples & References

p = 0.5 for a fair coin.

⚙️ Real-World Values & Applications

Binary classification output (spam/not); each pixel of a binarised image; click/no-click modelling.

🧮 The Formula — read as a table
P(X=k) = C(n,k) · pᵏ · (1−p)ⁿ⁻ᵏ
Piece In plain words
C(n,k) number of ways to pick which k trials succeed
pᵏ probability those k succeed
(1−p)ⁿ⁻ᵏ probability the other n−k fail
mean = np expected number of successes
📖 Reads as: The chance of getting exactly k successes in n independent yes/no trials.

🎮 Try it — interactive demo

💡 Simplified

How many heads in n coin flips.

📘 Examples & References

10 flips, p=0.5: expect 5 heads.

⚙️ Real-World Values & Applications

Conversion counts in A/B tests; defect counts in QA; correct predictions out of n.

🧮 The Formula — read as a table
P(X=k) = ( λᵏ · e^(−λ) ) / k!
λᵏ · e^(−λ)
k!
Part In plain words
▲ Top
λᵏ · e^(−λ)
weight for seeing k events when the average rate is λ
▼ Bottom
k!
divide by k-factorial (the count's arrangements)
📖 Reads as: The chance of k rare events in a fixed window when the average count is λ.

🎮 Try it — interactive demo

💡 Simplified

How many rare things happen in a window — calls per hour, typos per page.

📘 Examples & References

λ = 3 emails/hour.

⚙️ Real-World Values & Applications

Arrival rates (web traffic, network packets); count data in NLP; queueing systems.

🧮 The Formula — read as a table
f(x) = 1 / (b − a) on [a, b]
1
b − a
Part In plain words
▲ Top
1
equal weight for every value
▼ Bottom
b − a
spread that weight evenly across the whole range
📖 Reads as: Every value in the range is equally likely — total fairness.

🎮 Try it — interactive demo

💡 Simplified

Total fairness — every value in the range is just as probable.

📘 Examples & References

A random float in [0, 1).

⚙️ Real-World Values & Applications

Random weight init, dropout masks, random seeds, Monte-Carlo sampling.

🧮 The Formula — read as a table
f(x) = ( 1 / √(2πσ²) ) · e^( −(x−μ)² / 2σ² )
e^( −(x−μ)²/2σ² )
√(2πσ²)
Part In plain words
▲ Top
e^( −(x−μ)²/2σ² )
peaks at the mean μ and falls off as you move away (curve shape)
▼ Bottom
√(2πσ²)
a constant that makes the whole area equal 1
📖 Reads as: The bell curve: most values cluster near the mean μ, fewer appear as you go out by multiples of σ.

🎮 Try it — interactive demo

💡 Simplified

The classic bell curve — most values near the average, few far out.

📘 Examples & References

Heights, measurement noise; 68% fall within ±1σ.

⚙️ Real-World Values & Applications

Weight init, noise models, VAE priors, diffusion noise, Gaussian processes — the most important distribution in ML.

  • Central Limit Theorem
  • Expectation
  • Variance
  • Covariance
  • Correlation
🧮 The Formula — read as a table
mean of many i.i.d. variables → Gaussian
Piece In plain words
i.i.d. variables many independent samples from the same source
their average sum them and divide by n
→ Gaussian that average looks bell-shaped, whatever the source
📖 Reads as: Average enough independent random things and the result looks like a bell curve — no matter the source.

🎮 Try it — interactive demo

💡 Simplified

Average enough random things and the result looks like a bell curve.

📘 Examples & References

The average of 30+ dice rolls is approximately normal.

⚙️ Real-World Values & Applications

Why Gaussian assumptions work so often; the basis of confidence intervals and many statistical tests.

🧮 The Formula — read as a table
E[X] = Σ x · p(x) ( ∫ x f(x) dx )
Piece In plain words
x each possible value
p(x) its probability (the weight)
Σ x·p(x) probability-weighted average → the long-run mean
📖 Reads as: The probability-weighted average — what you'd get on average over endless repeats.

🎮 Try it — interactive demo

💡 Simplified

The average value you’d get if you repeated the experiment forever.

📘 Examples & References

Fair die: E[X] = 3.5.

⚙️ Real-World Values & Applications

Expected loss is what training minimises; expected reward in RL; expected value drives every risk/decision calc.

🧮 The Formula — read as a table
Var(X) = E[ (X − μ)² ]
(X − μ)²
averaged
Part In plain words
▲ Top
(X − μ)²
squared distance of each value from the mean
▼ Bottom
averaged
take the expected (mean) of those squared distances
📖 Reads as: The average squared distance from the mean — how spread out the values are.

🎮 Try it — interactive demo

💡 Simplified

How spread out the values are around the average — small = consistent, large = erratic.

📘 Examples & References

Fair die variance ≈ 2.92.

⚙️ Real-World Values & Applications

Bias–variance tradeoff governs over/underfitting; gradient variance affects stability; risk in finance.

🧮 The Formula — read as a table
Cov(X,Y) = E[ (X − μₓ)(Y − μᵧ) ]
Piece In plain words
(X−μₓ) how far X is from its mean
(Y−μᵧ) how far Y is from its mean
their product, averaged positive if they move together, negative if oppositely
📖 Reads as: Average the product of each pair's deviations — positive means they rise and fall together.

🎮 Try it — interactive demo

💡 Simplified

Do two quantities tend to rise and fall together (positive) or oppositely (negative)?

📘 Examples & References

Height and weight have positive covariance.

⚙️ Real-World Values & Applications

The covariance matrix drives PCA; portfolio risk; feature decorrelation and whitening.

🧮 The Formula — read as a table
ρ = Cov(X,Y) / ( σₓ · σᵧ )
Cov(X,Y)
σₓ · σᵧ
Part In plain words
▲ Top
Cov(X,Y)
how much X and Y move together
▼ Bottom
σₓ · σᵧ
divide by their spreads → a clean score from −1 to +1
📖 Reads as: Covariance scaled by the two spreads, giving a tidy −1…+1 strength of linear relationship.

🎮 Try it — interactive demo

💡 Simplified

Covariance normalised to −1…+1; ±1 = perfect line, 0 = no linear link.

📘 Examples & References

ρ = 0.9 is a strong positive relationship.

⚙️ Real-World Values & Applications

Feature selection, detecting redundancy/leakage, exploratory analysis — but correlation ≠ causation.

  • Sampling
  • Confidence Intervals
  • Hypothesis Testing
  • p-values
  • A/B Testing
  • MLE
  • MAP
  • Bayesian Statistics
🧮 The Formula — read as a table
draw x ~ distribution (inverse-CDF, rejection, MCMC)
Piece In plain words
distribution the probability pattern to draw from
x ~ generate values that follow that pattern
methods inverse-CDF, rejection, Markov-chain Monte Carlo
📖 Reads as: Generate example values that follow a chosen probability pattern.

🎮 Try it — interactive demo

💡 Simplified

Generating example values that follow a chosen probability pattern.

📘 Examples & References

Sampling from N(0,1) with np.random.randn.

⚙️ Real-World Values & Applications

Bootstrapping for uncertainty; mini-batch selection; generating text/images from a model’s distribution.

🧮 The Formula — read as a table
x̄ ± z · ( σ / √n )
σ
√n
Part In plain words
▲ Top
σ
the spread of the data
▼ Bottom
√n
shrinks with more samples → a tighter interval
📖 Reads as: The sample mean plus/minus a margin that shrinks as you collect more data.

🎮 Try it — interactive demo

💡 Simplified

A plausible range for the true value, with a stated level of trust.

📘 Examples & References

A 95% CI uses z ≈ 1.96.

⚙️ Real-World Values & Applications

Reporting metric uncertainty (accuracy ± CI); A/B-test result ranges; scientific reproducibility.

🧮 The Formula — read as a table
compare H₀ (no effect) vs H₁ (effect)
Piece In plain words
H₀ the boring default: nothing is happening
H₁ the claim: there's a real effect
test statistic reject H₀ if the evidence is strong enough
📖 Reads as: A formal way to decide whether an observed effect is real or just luck.

🎮 Try it — interactive demo

💡 Simplified

A formal way to decide whether an observed effect is real or just luck.

📘 Examples & References

t-test, chi-square test, z-test.

⚙️ Real-World Values & Applications

Deciding if model B truly beats model A; if a feature matters; if an A/B variant won.

🧮 The Formula — read as a table
p = P( data this extreme | H₀ is true )
Piece In plain words
H₀ true assume nothing is really going on
data this extreme chance of seeing a result as surprising as yours
small p unlikely under H₀ → probably a real effect
📖 Reads as: The chance of seeing your result if nothing were really going on; small means 'probably real'.

🎮 Try it — interactive demo

💡 Simplified

The chance of seeing your result if nothing were going on — small means a likely real effect.

📘 Examples & References

p = 0.03 < 0.05 → statistically significant.

⚙️ Real-World Values & Applications

Gatekeeper for A/B-test decisions; widely misused (p-hacking) — significance ≠ practical importance.

🧮 The Formula — read as a table
randomly split users → A vs B → compare metric
Piece In plain words
random split half see A, half see B — fair comparison
metric the number you care about (conversions, revenue)
compare test if B's lift is significant, not luck
📖 Reads as: Show A to half your users and B to the other half, then measure which performs better — fairly.

🎮 Try it — interactive demo

💡 Simplified

Show version A to half your users and B to the other half, then measure which wins.

📘 Examples & References

Testing a new recommendation model against the current one.

⚙️ Real-World Values & Applications

How tech companies ship product and ML changes; needs proper sample size, randomisation, and no peeking.

🧮 The Formula — read as a table
θ̂ = argmax Π p(xᵢ | θ) (max log-likelihood)
Piece In plain words
p(xᵢ|θ) how probable each data point is, given settings θ
Π (product) multiply across all data points
argmax pick the θ that makes the observed data most probable
📖 Reads as: Choose the parameters that make the data you actually observed as likely as possible.

🎮 Try it — interactive demo

💡 Simplified

Pick the parameters that make the data you actually observed most probable.

📘 Examples & References

MLE of a Gaussian’s mean is the sample average.

⚙️ Real-World Values & Applications

Training many models IS maximum likelihood — minimising cross-entropy = maximising label likelihood.

🧮 The Formula — read as a table
θ̂ = argmax p(x | θ) · p(θ)
Piece In plain words
p(x|θ) likelihood: how well θ explains the data (as in MLE)
p(θ) prior: your belief about reasonable θ before seeing data
argmax of product balance the two → like MLE plus regularisation
📖 Reads as: Like MLE, but also weigh in prior beliefs about which parameters are reasonable.

🎮 Try it — interactive demo

💡 Simplified

Like MLE, but you also bring in prior beliefs about reasonable parameter values.

📘 Examples & References

A Gaussian prior on weights ⟺ L2 regularisation.

⚙️ Real-World Values & Applications

Explains why weight decay works; injects domain knowledge; stabilises estimates with little data.

🧮 The Formula — read as a table
p(θ | x) = ( p(x | θ) · p(θ) ) / p(x)
p(x | θ) · p(θ)
p(x)
Part In plain words
▲ Top
p(x | θ) · p(θ)
likelihood × prior — evidence combined with belief
▼ Bottom
p(x)
a normaliser so the posterior sums/integrates to 1
📖 Reads as: Update beliefs with data: posterior ∝ likelihood × prior; keep a full distribution, not one number.

🎮 Try it — interactive demo

💡 Simplified

Start with a belief, see evidence, revise it — and keep a full distribution, not just one number.

📘 Examples & References

Bayesian A/B testing; Bayesian neural networks.

⚙️ Real-World Values & Applications

Uncertainty-aware predictions (medicine, autonomy); Bayesian optimisation for tuning; naive-Bayes spam filters.

 

© Kader Mohideen