Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 18.1 — Autoencoders
  • 18.2 — Variational Autoencoders
  • 18.3 — Generative Adversarial Networks
  • 18.4 — GAN Variants
  • 18.5 — Diffusion Models
  • 18.6 — Autoregressive & Flow-Based Models
  • 18.7 — Boltzmann Machines & Restricted Boltzmann Machines
  • 18.8 — Energy-Based Models and the Partition-Function Problem
  • 18.9 — Contrastive Divergence: Cheap Negative Samples
  • 18.10 — Score Matching: Learning the Gradient of Log-Density
  • 18.11 — Langevin Dynamics: Sampling by Following the Score
  • 18.12 — The Score-SDE View: One Theory Behind Diffusion and Score Matching
  • 18.13 — Vector-Quantized Models: Discrete Latent Codes
  • 18.14 — Fast Sampling: Distillation and Consistency Models
  • 18.15 — Quick reference
  • 18.16 — Key takeaways
  • 18.17 — See also

Chapter 18 — 🎨 Generative Models

📖 All chapters  |  ← 17 · ⚡ Attention & Transformers  |  19 · 👁️ Computer Vision →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

Most of the models in this encyclopedia learn to discriminate — given an input, predict a label. Generative models do something harder and stranger: they learn the shape of the data itself, so that they can produce brand-new samples that look like they came from the original distribution. A generative model that has seen enough faces can paint a face that never existed; one that has read enough text can write a new sentence. This chapter walks the family tree — from the humble autoencoder, through the probabilistic VAE, the adversarial GAN, the diffusion models that now dominate image synthesis, and back to the energy-based machines that helped start deep learning.

🧭 In context: Deep Learning · used to synthesize new images, audio, and text by modeling the data distribution · the one key idea: instead of mapping input → label, learn \(p(x)\) (or a way to sample from it) so you can generate fresh \(x\).

💡 Remember this: Every generative model in this chapter is one answer to the same question — how do you learn \(p(x)\) well enough to draw new samples from it — and they differ only in how they trade off sample quality, sampling speed, and whether you get an exact likelihood.

One picture is worth keeping in your head for the whole chapter: a discriminative model draws a fence between classes, while a generative model learns to re-draw the animals. The fence only needs to know where cats end and dogs begin; the painter needs to know what a cat actually looks like.

Discriminative: learn p(y | x) decision boundary Generative: learn p(x) model the shape of each blob

18.1 — Autoencoders

An autoencoder is a neural network trained to copy its input to its output — which sounds useless until you notice the trick: in the middle, the network is forced through a narrow bottleneck far smaller than the input. To reconstruct a 784-pixel digit from only 32 numbers, the network must learn what matters and throw away the rest. That squeeze is where the learning lives.

The architecture has three parts. The encoder \(f\) maps the input \(x\) to a compact latent code (or latent vector) \(z = f(x)\). The decoder \(g\) maps the code back to a reconstruction \(\hat{x} = g(z)\). Training minimizes the reconstruction loss, typically squared error:

\[\mathcal{L}(x) = \lVert x - g(f(x)) \rVert^2\]

In words: push the input through the squeeze and back out, then measure how far the output drifted from the original — that gap is what training shrinks. Also written: \(\mathcal{L}(x) = \sum_{i=1}^{d} (x_i - \hat{x}_i)^2\) with \(\hat{x} = g(f(x))\) — the same squared error spelled out coordinate by coordinate.

Because there are no labels — the target is the input — autoencoders are a form of self-supervised, unsupervised learning. The latent code is a learned, nonlinear compression; with linear layers and squared loss, an autoencoder in fact recovers the same subspace as PCA (see Chapter 7). The encoder, decoder, generator, and discriminator throughout this chapter are all ordinary neural networks trained by gradient descent.

flowchart LR
  X["input x<br/>(784 dims)"] --> E["encoder f"]
  E --> Z["latent z<br/>(32 dims)"]
  Z --> D["decoder g"]
  D --> Xh["reconstruction x̂<br/>(784 dims)"]
  Xh -. "‖x − x̂‖²" .-> X

x encoder z decoder x̂ bottleneck forces the network to keep only what matters

Here is a tiny from-scratch encoder/decoder pass — a single hidden layer each way — to make the data flow concrete.

import numpy as np
# x: one 8-dim input; squeeze to 3 dims and back
x = np.array([0.2, 0.9, 0.1, 0.7, 0.4, 0.8, 0.3, 0.6])
We1 = np.random.randn(3, 8) * 0.1      # encoder weights
Wd1 = np.random.randn(8, 3) * 0.1      # decoder weights
z   = np.tanh(We1 @ x)                 # latent code (3 numbers)
xh  = Wd1 @ z                          # reconstruction
loss = np.mean((x - xh) ** 2)          # what training minimizes
print(z.round(2), "loss:", round(loss, 3))

And here is the same idea as an actual trainable model in PyTorch — the form you would really use, with a 784→32 bottleneck for MNIST-sized digits.

import torch, torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, d_in=784, d_lat=32):
        super().__init__()
        self.enc = nn.Sequential(nn.Linear(d_in, 128), nn.ReLU(),
                                 nn.Linear(128, d_lat))
        self.dec = nn.Sequential(nn.Linear(d_lat, 128), nn.ReLU(),
                                 nn.Linear(128, d_in), nn.Sigmoid())
    def forward(self, x):
        z = self.enc(x)
        return self.dec(z), z

model = Autoencoder()
opt   = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()

x = torch.rand(64, 784)                 # a batch of 64 flattened images
xh, z = model(x)
loss = loss_fn(xh, x)                    # target IS the input
opt.zero_grad(); loss.backward(); opt.step()

Two important variants change what the bottleneck must learn. A denoising autoencoder receives a corrupted input \(\tilde{x}\) (e.g. with noise added or pixels masked) but is scored against the clean \(x\); to undo the corruption it must learn the underlying structure rather than memorize a copy. A sparse autoencoder keeps the code wide but adds a penalty (an L1 term or a KL penalty on average activation) so that only a few latent units fire for any input — forcing each unit to specialize into an interpretable feature.

The latent space itself is the prize. Points that are close in latent space tend to decode to similar inputs, so the code is a useful representation for downstream tasks, anomaly detection (high reconstruction error = unusual input), or denoising. But a plain autoencoder’s latent space is holey: sample a random \(z\) and the decoder often produces garbage, because nothing forced the codes to fill the space smoothly. Fixing that is exactly what the VAE does next.

A concrete use: fraud and defect detection. Train an autoencoder only on normal credit-card transactions (or only on good parts on a factory line). It learns to reconstruct normal patterns with tiny error. When a fraudulent transaction or a cracked part comes in, it does not match the learned patterns, so the reconstruction error spikes — and that error, thresholded, becomes your anomaly alarm. No labeled fraud examples needed.

Tip

Think of the encoder as a zip compressor that is allowed to be lossy and learns its own file format tuned to your data. The bottleneck size is the compression ratio — too wide and it just copies, too narrow and it blurs.

18.2 — Variational Autoencoders

A plain autoencoder learns points; a variational autoencoder (VAE) learns distributions. Instead of encoding \(x\) to a single code, the encoder outputs the parameters of a probability distribution over latent codes — a mean \(\mu\) and a variance \(\sigma^2\) — and the actual code is sampled from \(\mathcal{N}(\mu, \sigma^2)\). This single change turns the holey latent space into a smooth, continuous one you can sample from, which makes the VAE a true generative model: draw \(z \sim \mathcal{N}(0, I)\), run the decoder, get a new sample.

The intuition for why this fills the holes: a plain autoencoder is allowed to drop each input onto a single razor-thin point, leaving gulfs of empty space between codes. The VAE instead forces every input to claim a small fuzzy cloud of codes, and pushes all those clouds toward one shared blob (a unit Gaussian). Overlapping fuzzy clouds tile the space with no gaps — so any random point you pick decodes to something sensible.

plain AE: lonely points VAE: overlapping clouds ? sample here → garbage any point → something sensible

The training objective is the ELBO (evidence lower bound), which leans on probability and statistics — KL divergence, Gaussian priors, and expectations — and has two terms pulling in tension:

\[\mathcal{L} = \underbrace{\mathbb{E}_{q(z|x)}[\log p(x|z)]}_{\text{reconstruction}} - \underbrace{D_{\mathrm{KL}}\big(q(z|x)\,\Vert\,p(z)\big)}_{\text{regularizer}}\]

In words: decode the sampled code and reward it for rebuilding the input, but subtract a penalty for how far the encoder’s cloud has wandered from the standard-normal blob. Also written: maximizing this is equivalent to minimizing \(-\mathbb{E}_{q(z|x)}[\log p(x|z)] + D_{\mathrm{KL}}(q(z|x)\,\Vert\,p(z))\) — a reconstruction loss plus a KL regularizer, the form most code actually optimizes.

The first term is the familiar reconstruction quality: decode the sampled \(z\) and check it matches \(x\). The second term, the KL divergence, measures how far the encoder’s distribution \(q(z|x)\) has drifted from a standard normal prior \(p(z) = \mathcal{N}(0, I)\), and penalizes the drift. Reconstruction wants each input to grab its own private corner of latent space; the KL term pushes every code back toward the same unit Gaussian. The balance is what packs the codes together smoothly with no holes.

There is one obstacle: you cannot backpropagate through a random sampling step (see optimization for why gradients need a differentiable path). The reparameterization trick sidesteps it by moving the randomness outside the gradient path. Instead of sampling \(z \sim \mathcal{N}(\mu, \sigma^2)\) directly, write

\[z = \mu + \sigma \odot \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)\]

In words: build the random code as a fixed center plus a scaled dose of external noise, so the network parts (\(\mu\), \(\sigma\)) stay differentiable while the dice-roll lives in \(\epsilon\). Also written: \(z \sim \mathcal{N}(\mu, \operatorname{diag}(\sigma^2))\) — the very same sample, but expressed as a draw from the distribution instead of as a deterministic function of noise.

Now \(\epsilon\) is just external noise, and \(\mu, \sigma\) are deterministic functions of the network — gradients flow through them cleanly.

flowchart LR
  X["x"] --> ENC["encoder"]
  ENC --> MU["μ"]
  ENC --> SIG["σ"]
  EPS["ε ~ N(0,I)"] --> ADD["z = μ + σ·ε"]
  MU --> ADD
  SIG --> ADD
  ADD --> DEC["decoder"] --> XH["x̂"]

import numpy as np
mu    = np.array([0.5, -1.0])         # encoder outputs
sigma = np.array([0.8,  0.4])
eps   = np.random.randn(2)            # external noise
z     = mu + sigma * eps              # reparameterized sample
# KL( N(mu,sigma^2) || N(0,I) ) in closed form, per dim:
kl = 0.5 * np.sum(sigma**2 + mu**2 - 1 - np.log(sigma**2))
print("z:", z.round(2), " KL:", round(kl, 3))

In a real framework the reparameterization and the closed-form KL are just a few lines of PyTorch:

import torch, torch.nn as nn, torch.nn.functional as F

def vae_loss(x, x_hat, mu, logvar):
    recon = F.mse_loss(x_hat, x, reduction="sum")
    kl = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
    return recon + kl                      # the negative ELBO

mu, logvar = torch.zeros(8, 2), torch.zeros(8, 2)   # encoder outputs
std = torch.exp(0.5 * logvar)
z = mu + std * torch.randn_like(std)       # reparameterization trick

The KL has a clean closed form for two Gaussians, shown above — no sampling needed for that term. A famous side effect of the smooth latent space is interpolation: pick the codes of two real inputs, walk a straight line between them, and decode each step — you get a smooth morph (one face slowly becoming another), evidence that the space is genuinely continuous and meaningful. The cost is that VAEs tend to produce slightly blurry samples, because the Gaussian likelihood and the averaging over \(q(z|x)\) smear fine detail.

Where this is used. The smooth, navigable latent space is the selling point. In drug and material discovery, a VAE trained on known molecules turns chemistry into a continuous space you can optimize over — nudge a code toward “more soluble” and decode to a new candidate molecule. The same interpolation trick powers voice and face morphing, and the VAE’s encoder doubles as a feature extractor for downstream classifiers when labeled data is scarce.

Warning

If the KL term overwhelms reconstruction, the decoder learns to ignore \(z\) entirely (it can reconstruct an “average” output without it) — a failure called posterior collapse. Warming up the KL weight from 0, or down-weighting it (the \(\beta\)-VAE with \(\beta < 1\)), is the usual cure.

18.3 — Generative Adversarial Networks

A GAN (generative adversarial network) trains two networks locked in a contest, like a forger and a detective who improve by competing. The generator \(G\) takes random noise \(z\) and tries to produce fake samples that look real. The discriminator \(D\) takes a sample — real or fake — and outputs the probability that it is real. The generator wins when it fools the discriminator; the discriminator wins when it catches the fakes. Train them together and the forger gets so good its fakes are indistinguishable from reality.

This is a minimax game with a single value function:

\[\min_G \max_D \; \mathbb{E}_{x \sim p_{\text{data}}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log(1 - D(G(z)))]\]

In words: the detective tunes itself to call real things real and fakes fake, while the forger simultaneously tunes itself to make the detective call its fakes real — same scoreboard, opposite goals. Also written: equivalently \(\min_G \max_D \; \mathbb{E}_{x}[\log D(x)] + \mathbb{E}_{\hat{x}\sim p_G}[\log(1 - D(\hat{x}))]\) where \(\hat{x} = G(z)\) is just a sample from the generator’s distribution \(p_G\).

\(D\) maximizes it (assign high probability to real, low to fake); \(G\) minimizes it (make \(D(G(z))\) high). At the theoretical optimum the generator’s distribution matches the data and \(D\) is stuck guessing \(0.5\) everywhere. Crucially, there is no reconstruction loss and no explicit likelihood — the generator never sees a real image directly, it only ever gets gradient signal through the discriminator’s verdict. That indirectness is what makes GAN samples so sharp (no blur-inducing pixel-averaging) and also what makes them so hard to train.

flowchart LR
  Z["noise z"] --> G["generator G"]
  G --> F["fake sample"]
  R["real data"] --> D["discriminator D"]
  F --> D
  D --> P["real or fake?"]
  P -. "gradient updates both nets" .-> G

G forger fake D detective “real!” ✓ (G wins) “fake!” ✗ (D wins) each improves by trying to beat the other — until fakes are indistinguishable
import numpy as np
def sigmoid(t): return 1/(1+np.exp(-t))
# one toy step: discriminator scores a real and a fake
D_real = sigmoid(2.0)    # 0.88 -> D thinks real is real (good)
D_fake = sigmoid(-1.0)   # 0.27 -> D thinks fake is fake (good)
d_loss = -(np.log(D_real) + np.log(1 - D_fake))   # D minimizes this
g_loss = -np.log(D_fake)                          # G wants D_fake high
print("D loss:", round(d_loss,3), " G loss:", round(g_loss,3))

The real training loop alternates the two updates. Here is the idiomatic PyTorch shape of one step, using the non-saturating generator loss discussed below:

import torch, torch.nn as nn
bce = nn.BCEWithLogitsLoss()

def gan_step(G, D, opt_G, opt_D, real, z):
    ones  = torch.ones(real.size(0), 1)
    zeros = torch.zeros(real.size(0), 1)
    # --- train D: real -> 1, fake -> 0 ---
    fake = G(z).detach()                     # stop grads into G here
    d_loss = bce(D(real), ones) + bce(D(fake), zeros)
    opt_D.zero_grad(); d_loss.backward(); opt_D.step()
    # --- train G: make D call the fake "real" (non-saturating) ---
    g_loss = bce(D(G(z)), ones)
    opt_G.zero_grad(); g_loss.backward(); opt_G.step()
    return d_loss.item(), g_loss.item()

Two failure modes haunt GANs. Mode collapse is when the generator discovers one (or a few) outputs that reliably fool the discriminator and produces only those — a model trained on all ten digits that emits nothing but convincing 8s. It minimizes the loss while ignoring most of the data distribution. Training instability comes from the two-player dynamics: if the discriminator gets too good too fast, \(\log(1 - D(G(z)))\) saturates and the generator’s gradient vanishes; if it lags, the generator exploits it without really improving. Unlike a normal loss curve that slides downward, GAN training is a moving equilibrium that can oscillate or diverge. A common fix for the vanishing gradient is the non-saturating loss — train \(G\) to maximize \(\log D(G(z))\) instead of minimizing \(\log(1 - D(G(z)))\), which gives a strong gradient exactly when the generator is losing.

Note

Wasserstein GAN (WGAN). The single most influential stability fix replaces the discriminator’s “real-vs-fake probability” with a critic that scores how realistic a sample is on an unbounded scale, and trains it to estimate the Earth-Mover (Wasserstein) distance between the real and generated distributions. The intuition: instead of asking “is this fake?” (a yes/no that gives no gradient once the answer is obvious), ask “how much work would it take to reshape the fake pile of dirt into the real pile?” — a smooth, always-informative signal. WGAN-GP enforces the needed mathematical constraint with a gradient penalty and largely tamed the vanishing-gradient and mode-collapse problems that plagued early GANs.

Warning

A falling generator loss does not mean better samples — the two losses are coupled, so each can drop simply because the other network got worse. Judge GANs by looking at samples or a metric like FID (Fréchet Inception Distance), never by the loss alone.

18.3.1 — Evaluating generative models

Because there is no single “accuracy” for a model that invents data, generative work leans on a handful of purpose-built metrics. The most common for images is the Fréchet Inception Distance (FID): run both real and generated images through a fixed pretrained Inception network, take the feature activations, fit a Gaussian to each set, and measure the distance between those two Gaussians.

\[\text{FID} = \lVert \mu_r - \mu_g \rVert^2 + \operatorname{tr}\!\big(\Sigma_r + \Sigma_g - 2(\Sigma_r \Sigma_g)^{1/2}\big)\]

In words: summarize real and fake images each as a bell-shaped cloud of deep features, then measure how far apart the two clouds sit in both their centers and their spreads — lower is better. Also written: it is the squared 2-Wasserstein distance between two Gaussians \(\mathcal{N}(\mu_r,\Sigma_r)\) and \(\mathcal{N}(\mu_g,\Sigma_g)\) fitted to Inception features.

A lower FID means the generated feature distribution is closer to the real one, capturing both quality (sharp, realistic) and diversity (covers the modes) in one number — which is exactly why it catches mode collapse that the loss curve hides. Two companions are worth knowing: the older Inception Score (IS), which rewards images that are individually confidently classified yet collectively varied, and precision/recall for generative models, which splits the single FID number into two — precision (are the samples realistic?) versus recall (do they cover the whole data distribution?), letting you see which of the two a model is failing.

# Practical FID with a maintained library, not from scratch
from torchmetrics.image.fid import FrechetInceptionDistance
fid = FrechetInceptionDistance(feature=2048)
fid.update(real_images, real=True)     # uint8 images, shape (N,3,H,W)
fid.update(fake_images, real=False)
print("FID:", fid.compute().item())    # lower is better

18.4 — GAN Variants

The original GAN used fully-connected layers and was finicky on images. A cascade of architectural variants made GANs practical and gave them controllable, useful behavior. Three landmarks tell the story.

DCGAN (Deep Convolutional GAN) swapped the dense layers for convolutions and a few stabilizing recipes that became standard: strided convolutions instead of pooling, batch normalization in both networks, ReLU in the generator and LeakyReLU in the discriminator, and no fully-connected hidden layers. The generator builds an image by upsampling noise through transposed convolutions — from a vector, to a coarse 4×4 feature map, to a full image. DCGAN is the architecture most people mean by “a basic image GAN.”

CycleGAN solves unpaired image-to-image translation — turning horses into zebras or summer photos into winter ones without matched before/after pairs, which are usually impossible to collect. It trains two generators (\(G: A \to B\) and \(F: B \to A\)) and adds a cycle-consistency loss: if you translate an image to the other domain and back, you should recover the original, \(F(G(a)) \approx a\). That round-trip constraint is what keeps the translation faithful instead of producing an arbitrary realistic-looking image of the target domain.

\[\mathcal{L}_{\text{cyc}} = \mathbb{E}_a\big[\lVert F(G(a)) - a \rVert_1\big] + \mathbb{E}_b\big[\lVert G(F(b)) - b \rVert_1\big]\]

In words: translate to the other domain and back, and demand you land on the original photo — in both directions. Also written: \(\mathcal{L}_{\text{cyc}} = \mathbb{E}_a\big[\sum_i |F(G(a))_i - a_i|\big] + \mathbb{E}_b\big[\sum_i |G(F(b))_i - b_i|\big]\), the \(\ell_1\) (absolute-value) reconstruction error written as an explicit sum.

flowchart LR
  A["horse a"] --> G["G: A→B"] --> B["zebra G(a)"]
  B --> F["F: B→A"] --> A2["horse F(G(a))"]
  A2 -. "‖F(G(a)) − a‖₁ ≈ 0" .-> A

StyleGAN rethought how the generator consumes noise to gain fine-grained control over the output. Instead of feeding \(z\) in at the bottom, it maps \(z\) through a small network to an intermediate style vector \(w\), then injects \(w\) at every resolution of the generator via adaptive normalization. Low-resolution injections control coarse attributes (pose, face shape); high-resolution ones control fine details (skin texture, hair strands). This disentangles factors of variation, enabling “style mixing” — take the pose from one face and the texture from another — and produced the photorealistic faces of “thispersondoesnotexist.”

A fourth variant worth knowing rounds out the picture. The conditional GAN (cGAN) feeds a label \(y\) into both the generator and discriminator, so \(G(z, y)\) generates a sample of the requested class and \(D(x, y)\) judges whether the sample is a real example of that class. This is what turns a GAN from “make some plausible digit” into “make a 7,” and it underlies paired translation models like pix2pix.

Variant Problem it solves Key idea
DCGAN Stable GANs on images Convolutional generator/discriminator + BN, strided convs
cGAN Class-controlled generation Condition \(G\) and \(D\) on a label \(y\)
CycleGAN Unpaired translation Two generators + cycle-consistency loss
StyleGAN Controllable, high-fidelity faces Style vector injected at every resolution
Tip

The pattern across all four variants is the same lever: where you inject extra information. cGAN injects a label at the input; StyleGAN injects a style at every resolution; CycleGAN injects a constraint (the round-trip) into the loss. When you want a GAN to do something new, the first question is usually “what do I condition on, and where does it enter?”

18.5 — Diffusion Models

Diffusion models generate by learning to reverse a gradual destruction of data, and they are the reason GANs lost their crown for image synthesis (the central task of computer vision). The intuition: take a clean image and add a tiny bit of Gaussian noise, repeat hundreds of times until it is pure static — that is easy and needs no learning. Now train a network to undo one step of that process. Chain those learned denoising steps and you can start from pure noise and walk all the way back to a clean image.

A homely analogy: imagine slowly stirring a drop of ink into a glass of water until the water is uniformly gray — that mixing is the easy forward process. The astonishing claim of diffusion is that if you learn, at each instant, exactly which way the ink was drifting, you can run the film backward and watch the uniform gray un-mix itself back into the original sharp drop.

The forward (noising) process is fixed. Over \(T\) steps it adds noise according to a schedule, and thanks to a closed-form shortcut you can jump directly to any noise level \(t\) in one shot:

\[x_t = \sqrt{\bar\alpha_t}\, x_0 + \sqrt{1 - \bar\alpha_t}\, \epsilon, \qquad \epsilon \sim \mathcal{N}(0, I)\]

In words: any noisy version is just a faded copy of the clean image blended with a dose of pure noise, where the blend ratio is set by how far along the schedule you are. Also written: \(x_t \sim \mathcal{N}\!\big(\sqrt{\bar\alpha_t}\,x_0,\;(1-\bar\alpha_t)I\big)\) — the same statement phrased as “\(x_t\) is Gaussian centered on a shrunken copy of \(x_0\).”

where \(\bar\alpha_t\) shrinks from near 1 (barely any noise) to near 0 (almost pure noise) as \(t\) grows. The reverse (denoising) process is learned: a network \(\epsilon_\theta(x_t, t)\) is trained to predict the noise that was added, and the loss is dead simple — mean squared error between the true and predicted noise:

\[\mathcal{L} = \mathbb{E}_{x_0, t, \epsilon}\big[\lVert \epsilon - \epsilon_\theta(x_t, t)\rVert^2\big]\]

In words: show the network a noisy image and the timestep, and score it on how accurately it guesses the exact noise that was mixed in. Also written: \(\mathcal{L} = \mathbb{E}\big[\sum_i (\epsilon_i - \epsilon_\theta(x_t,t)_i)^2\big]\) — a plain per-pixel squared-error regression on the noise vector.

That stable regression objective — no adversarial game, no balancing act — is exactly why diffusion training is so much more robust than GAN training. The denoiser is almost always a U-Net (a convolutional encoder–decoder with skip connections), with the timestep \(t\) fed in so the network knows how much noise to expect.

flowchart LR
  X0["x₀ clean"] -->|"+ noise"| X1["x₁"]
  X1 -->|"+ noise"| Xdots["..."]
  Xdots -->|"+ noise"| XT["x_T pure noise"]
  XT -.->|"εθ denoise"| Ydots["..."]
  Ydots -.->|"εθ denoise"| Y1["x̂₁"]
  Y1 -.->|"εθ denoise"| Y0["x̂₀ generated"]

forward: add noise → ← reverse: remove noise x₀ clean x_T noise ━━━━▶
import numpy as np
x0 = np.array([1.0, 0.5, -0.3])          # a tiny "clean" signal
abar = 0.2                               # noise level at some step t
eps  = np.random.randn(3)                # the true added noise
xt   = np.sqrt(abar)*x0 + np.sqrt(1-abar)*eps   # forward (closed form)
# model's job: from xt (and t) predict eps; loss is plain MSE:
eps_pred = eps + 0.1*np.random.randn(3)  # pretend prediction
loss = np.mean((eps - eps_pred)**2)
print("xt:", xt.round(2), " loss:", round(loss,3))

In practice you rarely write the noising loop yourself — Hugging Face diffusers packages the scheduler and a U-Net so the whole training step is a few lines:

import torch
from diffusers import UNet2DModel, DDPMScheduler

model = UNet2DModel(sample_size=32, in_channels=3, out_channels=3)
sched = DDPMScheduler(num_train_timesteps=1000)

x0 = torch.randn(8, 3, 32, 32)                       # a batch of clean images
noise = torch.randn_like(x0)
t = torch.randint(0, 1000, (8,))                     # random timestep per image
xt = sched.add_noise(x0, noise, t)                   # forward, closed form
noise_pred = model(xt, t).sample                     # U-Net predicts the noise
loss = torch.nn.functional.mse_loss(noise_pred, noise)
loss.backward()                                      # that is the whole objective

Two refinements made diffusion dominant. Classifier-free guidance is how you steer generation with a text prompt: during training the model sometimes sees the prompt and sometimes sees a blank, learning both a conditional and an unconditional predictor. At sampling time you extrapolate away from the unconditional toward the conditional, \(\epsilon = \epsilon_{\text{uncond}} + s\,(\epsilon_{\text{cond}} - \epsilon_{\text{uncond}})\), where the guidance scale \(s > 1\) trades diversity for stronger adherence to the prompt. Latent diffusion fixes the cost problem: running hundreds of denoising steps on full-resolution pixels is brutally slow, so you first use a VAE-style autoencoder to compress the image into a small latent grid, run the entire diffusion process there, and decode once at the end. This is the engine inside Stable Diffusion, and it is why text-to-image went from research curiosity to commodity.

GAN Diffusion
Training Adversarial, unstable Simple MSE regression, stable
Sampling speed One forward pass (fast) Many steps (slow, improving)
Mode coverage Prone to collapse Covers the distribution well
Sample quality Sharp State-of-the-art, sharp & diverse
Tip

The whole trick fits in one sentence: learn to predict the noise in a noisy image, then peel noise off one layer at a time. Quality scales gracefully with the number of denoising steps — fewer steps for a quick draft, more for a polished result.

18.6 — Autoregressive & Flow-Based Models

Two more families round out the generative landscape, each taking a fundamentally different route to modeling \(p(x)\) — and both, unlike GANs, give you an exact likelihood.

Autoregressive models make no attempt to model the whole image or sentence at once. They use the chain rule of probability to factor the joint distribution into a product of conditionals, generating one element at a time, each conditioned on everything before it:

\[p(x) = \prod_{i=1}^{n} p(x_i \mid x_1, \dots, x_{i-1})\]

In words: the probability of the whole thing is the chance of the first piece, times the chance of the second given the first, times the third given the first two, and so on to the end. Also written: \(\log p(x) = \sum_{i=1}^{n} \log p(x_i \mid x_{<i})\) — taking logs turns the product into a sum, which is the form actually optimized.

For text this means predicting the next token given the previous ones — exactly how large language models work. For images, PixelRNN and PixelCNN predict each pixel given the pixels above and to the left. The payoff is a tractable, exact likelihood and very high sample quality; the price is slow sequential sampling — an \(n\)-pixel image needs \(n\) forward passes, because each step depends on the last.

flowchart LR
  S["start"] --> X1["x₁"]
  X1 --> X2["x₂ | x₁"]
  X2 --> X3["x₃ | x₁,x₂"]
  X3 --> D["... | x₁..ₙ₋₁"]

import numpy as np
# autoregressive sampling from learned conditionals (toy, 3 binary pixels)
def p_next(prev):                      # pretend "model": prob next=1
    return 1/(1+np.exp(-(0.5*sum(prev) - 0.5)))
x = []
for i in range(3):
    p = p_next(x)                      # condition on everything so far
    x.append(int(np.random.rand() < p))
print("sampled:", x)                   # one pass per element

Normalizing flows take the opposite tack: they build an invertible network that transforms a simple base distribution (a Gaussian) into the complex data distribution. Because every layer is a bijection with a computable Jacobian, the change-of-variables formula gives the exact density of any data point:

\[\log p(x) = \log p(z) + \log\left| \det \frac{\partial z}{\partial x} \right|, \qquad z = f(x)\]

In words: the likelihood of a data point equals the likelihood of where it lands in the simple Gaussian, corrected by how much the invertible map locally stretches or shrinks space. Also written: equivalently \(p(x) = p(z)\,\big|\det J_f(x)\big|\) with \(J_f = \partial z/\partial x\) — the non-log form of the change-of-variables formula.

You train by directly maximizing this exact log-likelihood — no lower bound like the VAE’s ELBO, no adversarial game. To sample, you run the network backwards: draw \(z\) from the Gaussian and apply \(f^{-1}\). The constraint that every transformation be invertible with a cheap Jacobian determinant (the trick behind RealNVP and Glow) is what limits flow expressiveness and keeps them less common than diffusion today, but they remain the cleanest example of exact-likelihood deep generation.

Family Likelihood Sampling Signature tradeoff
VAE Lower bound (ELBO) Fast (one pass) Blurry samples
GAN None (implicit) Fast (one pass) Sharp but unstable, mode collapse
Diffusion Lower bound Slow (many steps) Top quality, slow sampling
Autoregressive Exact Slow (sequential) Exact + high quality, but \(n\) passes
Normalizing flow Exact Fast (one inverse pass) Exact density, but limited expressiveness
Tip

Pick by what you need: exact likelihood and density estimation → flows or autoregressive; sharp images fast → GAN; best quality and stable training → diffusion; a smooth, useful latent space → VAE.

The same choice as a decision tree — start from what you need most and follow the branch:

flowchart TD
  Q["What matters most?"]
  Q -->|"exact likelihood /<br/>density estimation"| L["autoregressive<br/>(slow, top quality)<br/>or normalizing flow<br/>(fast inverse, less expressive)"]
  Q -->|"best sample quality,<br/>stable training"| DF["diffusion<br/>(slow sampling →<br/>distill for speed)"]
  Q -->|"one-pass speed<br/>above all"| G["GAN<br/>(sharp, watch for<br/>mode collapse)"]
  Q -->|"smooth, navigable<br/>latent space"| V["VAE<br/>(slightly blurry,<br/>great for interpolation)"]
  Q -->|"discrete tokens for a<br/>Transformer to model"| VQ["VQ-VAE / VQGAN<br/>(tokenize, then<br/>autoregress)"]

18.7 — Boltzmann Machines & Restricted Boltzmann Machines

Before VAEs and GANs, the dominant idea in generative modeling was energy-based: define an energy function over configurations of the data, declare that low-energy configurations are probable and high-energy ones rare, and learn the energy so that real data sits in the valleys. A Boltzmann machine is exactly this — an undirected probabilistic graphical model of binary units (some visible, some hidden) where every unit connects to every other, and the probability of a joint configuration follows the Boltzmann distribution:

\[p(v, h) = \frac{1}{Z} e^{-E(v, h)}, \qquad E(v,h) = -\sum_{i<j} w_{ij}\, s_i s_j - \sum_i b_i s_i\]

In words: a configuration’s probability falls off exponentially with its energy, so low-energy states are exponentially more likely — and \(Z\) just rescales everything so the probabilities sum to one. Also written: \(p(v,h) \propto \exp\!\big(\sum_{i<j} w_{ij} s_i s_j + \sum_i b_i s_i\big)\) — dropping the normalizer \(Z\) and folding the minus sign into the exponent.

The trouble is the partition function \(Z\) — the sum over all possible configurations needed to normalize the probability. It is exponentially large and intractable, and the dense all-to-all connectivity makes inference hopelessly slow. The Boltzmann machine was elegant but barely usable.

The Restricted Boltzmann Machine (RBM) added one restriction that changed everything: drop all visible–visible and hidden–hidden connections, keeping only the bipartite visible↔︎hidden links. Now, given the visible units, all hidden units are conditionally independent of each other (and vice versa), so you can sample an entire layer in parallel with a single sigmoid:

\[p(h_j = 1 \mid v) = \sigma\Big(b_j + \sum_i w_{ij} v_i\Big)\]

In words: with no within-layer wires, each hidden unit turns on independently with a probability set by a plain weighted sum of the visible units, squashed through a sigmoid. Also written: in vector form \(p(h = 1 \mid v) = \sigma(b + W^\top v)\), computing the whole hidden layer’s on-probabilities in one matrix multiply.

hidden h visible v no within-layer edges → a whole layer samples in parallel

Training still can’t compute \(Z\), but contrastive divergence (CD) gives a cheap approximation. Start from a real data vector on the visible units, sample the hidden units, sample the visible units back (“reconstruction”), and nudge the weights to make the real data more probable than its reconstruction — usually after just one round-trip (CD-1). The update is the difference between the data correlations and the reconstruction correlations:

\[\Delta w_{ij} \propto \langle v_i h_j \rangle_{\text{data}} - \langle v_i h_j \rangle_{\text{recon}}\]

In words: strengthen a connection if a visible and a hidden unit fire together more on real data than on the model’s own reconstruction, and weaken it otherwise. Also written: \(\Delta W \propto \langle v h^\top \rangle_{\text{data}} - \langle v h^\top \rangle_{\text{recon}}\) — the same update as a difference of two outer-product (correlation) matrices.

import numpy as np
def sigmoid(t): return 1/(1+np.exp(-t))
v0 = np.array([1.,0.,1.])             # a data vector
W  = np.random.randn(2,3)*0.1
h0 = (sigmoid(W @ v0) > np.random.rand(2)).astype(float)   # sample hidden
v1 = (sigmoid(W.T @ h0) > np.random.rand(3)).astype(float) # reconstruct
h1 = sigmoid(W @ v1)
dW = np.outer(h0, v0) - np.outer(h1, v1)   # CD-1 weight update
print("ΔW:\n", dW.round(2))

RBMs matter mostly for their historical role. Around 2006, stacking RBMs and training them greedily one layer at a time — layer-wise pretraining to form a Deep Belief Network — was the trick that first let researchers train deep networks before good initialization, ReLUs, and big GPUs made it unnecessary. RBMs are rarely used in production today, but they seeded the deep-learning revival and remain the cleanest introduction to energy-based generative modeling.

Tip

Energy-based thinking is making a comeback: modern score-based diffusion models are, under the hood, learning the gradient of an energy landscape — the same valleys-are-data idea, just made tractable by predicting scores instead of normalizing \(Z\).

18.8 — Energy-Based Models and the Partition-Function Problem

Most generative models we have seen so far buy their tractability with a structural promise. A VAE promises a clean latent prior and a decoder; an autoregressive model promises a fixed factorization order; a normalizing flow promises invertibility. Energy-based models (EBMs) refuse all of these promises. They make the loosest possible commitment: assign every possible configuration \(x\) a single scalar called its energy, \(E_\theta(x)\), where low energy means “plausible” and high energy means “implausible.” That is the entire modeling assumption. Any neural network that maps \(x\) to a real number is a valid energy function.

The intuition is a landscape. Picture the space of all images as a vast terrain, and \(E_\theta(x)\) as its elevation. Real images sit in deep valleys; noise sits on high ridges. To turn this landscape into a probability distribution we say that depth should mean likelihood, via the Boltzmann (Gibbs) form:

\[ p_\theta(x) = \frac{e^{-E_\theta(x)}}{Z(\theta)}, \qquad Z(\theta) = \int e^{-E_\theta(x)}\, dx. \]

In words: an image’s probability is set by how deep its energy valley is, divided by a grand total over every possible image so the whole thing integrates to one. Also written: \(p_\theta(x) = \operatorname{softmax}(-E_\theta)\) in the continuous limit — the energies are turned into probabilities by exponentiating and normalizing, exactly like a softmax over an infinite set of outcomes.

The numerator is easy: one forward pass through a network. The trouble is entirely in the denominator. \(Z(\theta)\), the partition function, is the integral of \(e^{-E_\theta(x)}\) over the entire input space — every possible image, molecule, or sentence. For anything beyond a few dimensions this integral is hopeless to compute exactly. And \(Z\) is not a harmless constant we can ignore, because it depends on \(\theta\): every time we nudge the weights, the whole normalizer shifts.

Tip

The freedom of EBMs and their difficulty are the same fact. Because \(E_\theta\) has no structural constraints, it can represent distributions a flow or autoregressive model cannot — but precisely because it has no structure, there is no shortcut for \(Z\).

Here is why \(Z\) blocks naive training. The log-likelihood of a data point is

\[ \log p_\theta(x) = -E_\theta(x) - \log Z(\theta). \]

To maximize it we take the gradient with respect to \(\theta\). The first term is trivial. The second expands into something remarkable:

\[ \nabla_\theta \log Z(\theta) = -\,\mathbb{E}_{x \sim p_\theta}\!\big[\nabla_\theta E_\theta(x)\big]. \]

So the full gradient of the log-likelihood is

\[ \nabla_\theta \log p_\theta(x) = -\nabla_\theta E_\theta(x_{\text{data}}) + \mathbb{E}_{x \sim p_\theta}\!\big[\nabla_\theta E_\theta(x)\big]. \]

In words: to improve, dig the energy down at the real data point, and at the same time push the energy up wherever the model currently thinks data lives — and stop when those two pulls cancel. Also written: \(\nabla_\theta \log p_\theta(x) = -\big(\nabla_\theta E_\theta(x_{\text{data}}) - \mathbb{E}_{x\sim p_\theta}[\nabla_\theta E_\theta(x)]\big)\) — the “positive phase minus negative phase” form, a single expectation gap.

Read this as two opposing forces. The first term pushes energy down at observed data points — dig the valleys deeper. The second term pushes energy up at samples drawn from the model itself — flatten the hallucinated valleys the model currently believes in. At convergence the two cancel: the model’s own samples look statistically like the data, so it stops digging. This is the celebrated “positive phase minus negative phase” structure, and it is exactly what makes EBMs hard, because the negative phase requires sampling from \(p_\theta\), which itself requires the intractable \(Z\).

A worked one-dimensional example. Let the energy be a simple quadratic \(E_\theta(x) = \tfrac{1}{2}\theta x^2\) with \(\theta > 0\). Then \(p_\theta(x) \propto e^{-\theta x^2/2}\), which is a Gaussian with variance \(1/\theta\). Here \(Z\) is one of the rare cases we can do by hand:

\[ Z(\theta) = \int_{-\infty}^{\infty} e^{-\theta x^2/2}\, dx = \sqrt{\frac{2\pi}{\theta}}. \]

So \(\log p_\theta(x) = -\tfrac{1}{2}\theta x^2 - \tfrac{1}{2}\log(2\pi/\theta)\). Differentiating, \(\nabla_\theta \log p_\theta(x) = -\tfrac{1}{2}x^2 + \tfrac{1}{2\theta}\). Setting the expected gradient over data to zero gives \(\tfrac{1}{2}\mathbb{E}[x^2] = \tfrac{1}{2\theta}\), i.e. \(\theta = 1/\mathbb{E}[x^2]\) — the maximum-likelihood variance estimate, exactly right. The two phases are visible: \(-\tfrac12 x^2\) is the data term pulling energy down where data lives, and \(+\tfrac{1}{2\theta} = \tfrac12\mathbb{E}_{p_\theta}[x^2]\) is the model term, the negative phase, computed here in closed form only because \(Z\) was tractable. In real models that second expectation is the wall we keep hitting.

Energy landscape: data carves valleys, model fills phantom ones configuration x energy E(x) data positive phase: push down negative phase: push up (no data here)

The deeper lesson is that EBMs convert modeling difficulty into sampling difficulty. They are maximally expressive but you pay for that expressiveness every time you need a gradient, because the negative phase is an expectation over a distribution you can only reach through approximate sampling. The next sections are essentially three different strategies for paying that bill: approximate the negative-phase samples cheaply (contrastive divergence), sidestep \(Z\) entirely by matching gradients instead of densities (score matching), and use a physics-inspired sampler to walk the landscape (Langevin dynamics).

18.9 — Contrastive Divergence: Cheap Negative Samples

If the only obstacle is that the negative phase needs samples from \(p_\theta\), the obvious question is: how bad would it be to get those samples approximately? Contrastive divergence (CD), introduced by Hinton for training restricted Boltzmann machines, is the answer that made EBMs practical for years. Its bet is brazenly pragmatic: you do not need samples from the true equilibrium distribution, you just need samples that are worse than the data in a direction that tells the gradient which way to move.

The principled way to sample \(p_\theta\) is to run a Markov chain — say Langevin or Gibbs steps — from some starting point and let it mix for a very long time until it forgets where it started and settles into \(p_\theta\). That mixing can take thousands of steps. CD’s shortcut is to start the chain at a real data point and run it for only \(k\) steps, often just one. The data is already close to where the model’s high-probability region should be, so even a single step nudges it toward wherever the model currently disagrees with the data.

flowchart LR
  D["data point x⁺"] -->|"k Gibbs/Langevin steps"| N["negative sample x⁻"]
  D --> P["positive phase:<br/>lower E(x⁺)"]
  N --> M["negative phase:<br/>raise E(x⁻)"]
  P --> U["θ ← θ − η(∇E(x⁺) − ∇E(x⁻))"]
  M --> U

The CD-\(k\) update for a single example is

\[ \Delta\theta \;\propto\; -\nabla_\theta E_\theta(x^{+}) \;+\; \nabla_\theta E_\theta(x^{-}), \]

In words: lower the energy at the real example and raise it at the model’s short-chain imitation, by exactly the gap between their energy gradients. Also written: \(\Delta\theta \propto -\big(\nabla_\theta E_\theta(x^{+}) - \nabla_\theta E_\theta(x^{-})\big)\) — the exact EBM gradient of §18.8 with the true negative sample swapped for a \(k\)-step approximation \(x^-\).

where \(x^{+}\) is the data point and \(x^{-}\) is the result of \(k\) chain steps started from \(x^{+}\). Compare this to the exact gradient from the previous section: it is the same expression, except the negative-phase sample comes from a short, biased chain rather than the true \(p_\theta\). CD is therefore a biased gradient estimator. It does not exactly follow the likelihood gradient, but it follows something close enough to learn good models in practice, and it is cheap.

Worked example on a tiny RBM. Consider a restricted Boltzmann machine with visible units \(v\) and hidden units \(h\) and energy \(E(v,h) = -v^\top W h - b^\top v - c^\top h\). The conditionals factorize, which is what makes Gibbs sampling easy: \(p(h_j=1\mid v) = \sigma(c_j + \sum_i v_i W_{ij})\) and symmetrically \(p(v_i=1\mid h) = \sigma(b_i + \sum_j W_{ij} h_j)\). CD-1 is then four lines.

import numpy as np
sig = lambda z: 1/(1+np.exp(-z))

def cd1_update(v_data, W, b, c, lr=0.1):
    # positive phase: hidden probs given real data
    ph_pos = sig(c + v_data @ W)              # p(h=1 | v_data)
    h_pos  = (np.random.rand(*ph_pos.shape) < ph_pos).astype(float)
    # one Gibbs step down-and-up to get the negative sample
    pv_neg = sig(b + h_pos @ W.T)             # reconstruct visibles
    v_neg  = (np.random.rand(*pv_neg.shape) < pv_neg).astype(float)
    ph_neg = sig(c + v_neg @ W)               # hidden probs given recon
    # gradients = positive correlations minus negative correlations
    W += lr * (v_data.T @ ph_pos - v_neg.T @ ph_neg) / len(v_data)
    b += lr * (v_data - v_neg).mean(0)
    c += lr * (ph_pos - ph_neg).mean(0)
    return W, b, c

The two outer-product terms v_data.T @ ph_pos and v_neg.T @ ph_neg are the positive and negative phases made concrete: correlations measured on real data versus on a one-step reconstruction. Learning stops when the model reconstructs its data so well that the one-step sample is statistically indistinguishable from the input, so the two correlation terms cancel.

Warning

CD optimizes a surrogate, not the true likelihood, and the bias does not vanish as you collect more data — it comes from truncating the chain, not from sampling noise. Symptoms include energy landscapes with spurious low-energy regions far from any data (the short chain never travels far enough to discover and penalize them). Persistent CD (PCD) helps: instead of restarting the chain at each data point, you keep a persistent set of “fantasy particles” that continue evolving across parameter updates, so over many updates they explore more of the landscape and catch those distant phantom valleys.

The honest summary is that CD trades a correctness guarantee for tractability. For decades it was the workhorse that let Boltzmann machines and deep belief networks train at all. Modern EBM training has largely moved to the score-based and Langevin-based methods in the next sections, which avoid the partition function by a different and often cleaner route — but CD remains the cleanest illustration of the core idea: learn by contrasting real data against the model’s own approximate fantasies.

18.10 — Score Matching: Learning the Gradient of Log-Density

Contrastive divergence accepts a biased gradient to dodge \(Z\). Score matching asks a sharper question: is there a quantity we could learn that does not contain \(Z\) at all? Remarkably, yes. The trick is to stop modeling the density and start modeling its gradient.

Define the score of a distribution as the gradient of its log-density with respect to the input — not the parameters, the input:

\[ s_\theta(x) \;=\; \nabla_x \log p_\theta(x). \]

In words: at each point in input space, the score is the arrow pointing in the direction that most increases log-probability — “which way is more data-like from here.” Also written: \(s_\theta(x) = \big(\partial \log p_\theta/\partial x_1,\dots,\partial \log p_\theta/\partial x_d\big)\) — the gradient written out as its vector of partial derivatives.

Now substitute the Boltzmann form. Since \(\log p_\theta(x) = -E_\theta(x) - \log Z(\theta)\), and \(\log Z(\theta)\) does not depend on \(x\), its gradient with respect to \(x\) is exactly zero:

\[ \nabla_x \log p_\theta(x) \;=\; -\nabla_x E_\theta(x). \]

In words: differentiating with respect to the input (not the parameters) wipes out the constant \(Z\) entirely, leaving the score as just the downhill direction of the energy. Also written: \(s_\theta(x) = -\nabla_x E_\theta(x)\) — the score is literally the negative force field of the energy landscape.

The partition function vanishes. This is the whole idea in one line. The score is a vector field over input space — at every point it tells you which direction increases log-probability fastest, i.e. which way is “downhill” in energy. It carries no information about the absolute normalization, only about relative shape, and that is precisely the part we can learn cheaply.

The score is a vector field pointing toward high density data density p(x) score s(x) = ∇ₓ log p(x)

So the new goal is: learn a network \(s_\theta(x)\) whose output vector field matches the true data score \(\nabla_x \log p_{\text{data}}(x)\). The natural objective is the expected squared difference, called the (explicit) Fisher divergence:

\[ J(\theta) \;=\; \tfrac{1}{2}\,\mathbb{E}_{x\sim p_{\text{data}}}\big\|\, s_\theta(x) - \nabla_x \log p_{\text{data}}(x)\,\big\|^2. \]

In words: average, over real data, how badly the model’s arrow disagrees with the true data arrow at each point. Also written: \(J(\theta) = \tfrac12\,\mathbb{E}_{p_{\text{data}}}\big[\sum_i (s_\theta(x)_i - \partial_{x_i}\log p_{\text{data}}(x))^2\big]\) — the squared norm spelled out per coordinate.

But this looks circular — we do not know the true data score; that is the whole thing we are trying to learn. Hyvärinen’s key result (2005) is that under mild boundary conditions you can integrate by parts to eliminate the unknown true score entirely, leaving an objective that depends only on the model:

\[ J(\theta) \;=\; \mathbb{E}_{x\sim p_{\text{data}}}\!\left[\, \tfrac{1}{2}\,\|s_\theta(x)\|^2 \;+\; \operatorname{tr}\!\big(\nabla_x s_\theta(x)\big) \,\right] + \text{const}. \]

In words: keep the model’s arrows modest in length, but reward them for spreading inward toward the data — and the unknown true score has dropped out completely. Also written: \(J(\theta) = \mathbb{E}\big[\tfrac12\|s_\theta(x)\|^2 + \sum_i \partial^2 \log p_\theta/\partial x_i^2\big] + \text{const}\) — the trace of the score’s Jacobian is the sum of the log-density’s second derivatives (its Laplacian).

Read the two terms as a tug-of-war. The first, \(\tfrac12\|s_\theta\|^2\), is a “don’t exaggerate” penalty: it keeps the arrows short, so the model can’t claim steep gradients for free. The second, the Jacobian trace (equivalently \(\sum_i \partial^2 \log p_\theta / \partial x_i^2\), the sum of second derivatives), is a “point inward” reward: it pays the model for arrows that converge, all aiming toward a common center the way water flows to a drain. Big inward-pointing arrows are exactly what a density peak looks like — so the trade-off forces the arrows to converge precisely where the data piles up, and the unknown true score never appears. No \(Z\) anywhere.

Worked example, one dimension. Take a Gaussian model \(p_\theta(x) \propto e^{-\theta x^2/2}\) again, so \(\log p_\theta(x) = -\tfrac\theta2 x^2 + \text{const}\) and the model score is \(s_\theta(x) = \partial_x \log p_\theta = -\theta x\). Plug into the integrated objective: \(\tfrac12 s_\theta^2 = \tfrac12\theta^2 x^2\) and \(\partial_x s_\theta = -\theta\), so

\[ J(\theta) = \mathbb{E}_{x}\!\left[\tfrac12\theta^2 x^2 - \theta\right] = \tfrac12\theta^2\,\mathbb{E}[x^2] - \theta. \]

Set \(dJ/d\theta = \theta\,\mathbb{E}[x^2] - 1 = 0\), giving \(\theta = 1/\mathbb{E}[x^2]\) — the same maximum-likelihood answer as in §18.8, but obtained with no partition function anywhere in sight. That is the payoff: an estimator that is consistent (it recovers the right parameters) yet never normalizes.

Warning

The Jacobian-trace term is the catch. For a \(d\)-dimensional input, computing \(\operatorname{tr}(\nabla_x s_\theta)\) exactly costs \(d\) backward passes — fatal for images where \(d\) is in the millions. This is why plain score matching does not scale, and why two workarounds dominate practice: sliced score matching, which estimates the trace with random projections (Hutchinson’s trick) at the cost of one extra pass, and denoising score matching, which we meet next as the bridge to diffusion.

A second, subtler pitfall: score matching on clean data learns the score accurately only where data lives. In the vast empty regions between data clusters the data density is near zero, there are no training samples, and the learned score points in essentially random directions. A sampler that starts in those empty regions gets no useful guidance. This single failure mode is what motivates adding noise — and noise is exactly what links score matching to diffusion.

18.11 — Langevin Dynamics: Sampling by Following the Score

We now have a way to learn the score \(s_\theta(x) \approx \nabla_x \log p(x)\) without ever computing \(Z\). But a generative model has to produce samples, and a vector field is not a sample. Langevin dynamics is the bridge: it is a procedure that turns a score function into a sampler by treating the score as a force and letting a particle drift down the energy landscape while being kicked by noise.

The physical picture is a speck of pollen in water. It feels a deterministic drift — gravity, or here the score pulling it toward high-density regions — plus relentless random molecular buffeting. Left alone, it does not fall to the single lowest point and freeze; instead it wanders, spending more time in deep regions and less in shallow ones, and the long-run fraction of time it spends at each location is exactly proportional to the probability density. That is the magic: pure gradient descent on energy collapses to the single mode, but gradient descent plus calibrated noise visits the whole distribution.

The update rule, Langevin Monte Carlo, makes this concrete. Starting from any point \(x_0\) (even pure noise), iterate

\[ x_{t+1} \;=\; x_t \;+\; \tfrac{\epsilon}{2}\,\nabla_x \log p(x_t) \;+\; \sqrt{\epsilon}\;z_t, \qquad z_t \sim \mathcal{N}(0, I). \]

In words: take a small step in the score’s direction (toward higher density), then add a calibrated random jolt — repeat, and the parade of points traces out the target distribution. Also written: \(x_{t+1} = x_t - \tfrac{\epsilon}{2}\,\nabla_x E(x_t) + \sqrt{\epsilon}\,z_t\), since the score equals the negative energy gradient — a noisy gradient descent on energy.

The middle term is the score: a step uphill in log-density, i.e. downhill in energy. The last term is the random kick, scaled so that as the step size \(\epsilon \to 0\) and the number of steps \(\to \infty\), the distribution of \(x_t\) converges to \(p(x)\) exactly. Drop the noise term and you get ordinary gradient ascent that converges to the nearest mode and stops — a single image, not a sample from the distribution. The noise is what makes it a sampler rather than an optimizer.

drift toward density + random kicks → a sample high-density region p(x) start (noise)

flowchart LR
  X0["x₀ ~ noise"] --> S["score step:<br/>+ (ε/2)∇ₓ log p(x)"]
  S --> N["noise kick:<br/>+ √ε · z"]
  N --> X1["xₜ₊₁"]
  X1 -->|"repeat T steps"| S
  X1 --> OUT["sample ≈ p(x)"]

Worked example. Sample a standard Gaussian, whose score is \(\nabla_x \log p(x) = -x\). The update becomes \(x_{t+1} = x_t - \tfrac\epsilon2 x_t + \sqrt\epsilon\, z_t\). Watch how it self-corrects: when \(x_t\) is large and positive the drift \(-\tfrac\epsilon2 x_t\) pulls it back toward zero; when it is near zero the drift nearly vanishes and the noise dominates, scattering it back out. The balance of pull-back and kick reproduces the bell curve.

import numpy as np
def langevin(score, x, eps=0.01, steps=1000):
    for _ in range(steps):
        x = x + 0.5*eps*score(x) + np.sqrt(eps)*np.random.randn(*x.shape)
    return x

score = lambda x: -x                       # score of N(0,1)
samples = langevin(score, np.zeros(20000)) # start all at 0
print(samples.mean(), samples.std())       # ≈ 0.0, ≈ 1.0

Starting every particle at zero and running the chain, the empirical mean and standard deviation come out near \(0\) and \(1\): the sampler has reconstructed the target distribution using nothing but its score.

Warning

Plain Langevin mixes painfully slowly when the distribution has well-separated modes. Between two clusters the density is near zero, the score is tiny, and the noise has to do all the work to ferry a particle across the empty gap — which can take astronomically many steps. Combined with the §18.10 observation that the learned score is unreliable in exactly those low-density gaps, naive “learn the score, then run Langevin” simply does not work on real data. The fix is the central trick of the next section: add noise at many scales, which both fills in the empty regions with usable gradient signal and smooths the landscape so the chain can travel.

The thing to carry forward is that Langevin dynamics decouples learning from sampling. You learn a single object, the score field, by whatever means (score matching, denoising). Then sampling is a fixed, model-free recipe: drift along the score, add noise, repeat. This clean separation is exactly the structure that diffusion models exploit, and the next section shows the two are the same theory written in different notation.

18.12 — The Score-SDE View: One Theory Behind Diffusion and Score Matching

The previous chapter introduced diffusion models as a discrete chain: add Gaussian noise step by step until an image becomes static, then train a network to undo each step. §18.10–18.11 introduced score matching and Langevin sampling from the entirely separate world of energy-based models. The score-SDE framework of Song and colleagues (2021) revealed that these are not two ideas but one, viewed at two resolutions. The unification is one of the most clarifying results in modern generative modeling, and it falls out of taking the noise schedule to its continuous limit.

Start with the diffusion forward process but let the steps shrink to infinitesimals. The accumulation of tiny independent Gaussian perturbations is, in the continuum, a stochastic differential equation (SDE):

\[ dx \;=\; f(x,t)\,dt \;+\; g(t)\,dw, \]

In words: in each instant the data drifts a little (the \(f\) term) and gets jostled a little (the \(g\,dw\) noise term) — pile up enough instants and a clean image dissolves into static. Also written: the discrete one-step version is \(x_{t+\Delta t} = x_t + f(x_t,t)\Delta t + g(t)\sqrt{\Delta t}\,z\), \(z\sim\mathcal N(0,I)\) — the SDE is just this Euler step in the \(\Delta t \to 0\) limit.

where \(f\) is a deterministic drift, \(g\) scales the noise, and \(dw\) is the increment of a Wiener process (Brownian motion). This forward SDE just describes data dissolving into noise as time \(t\) runs from \(0\) to \(1\) — nothing is learned here. The remarkable part is a classical result by Anderson (1982): every such forward SDE has a corresponding reverse-time SDE that runs the clock backward, turning noise back into data, and it has an explicit form:

\[ dx \;=\; \big[\,f(x,t) - g(t)^2\,\nabla_x \log p_t(x)\,\big]\,dt \;+\; g(t)\,d\bar w. \]

In words: to run the noising film backward, follow the original drift but bend it toward higher density using the score, and keep a noise term — that single correction is all it takes to turn static back into images. Also written: the only data-dependent unknown is \(\nabla_x \log p_t(x)\), the time-\(t\) score; everything else (\(f\), \(g\)) is copied from the fixed forward process.

Look at what appears in the reverse drift: \(\nabla_x \log p_t(x)\) — the score of the noised data distribution at time \(t\). The only unknown needed to reverse diffusion is the score function. Diffusion is not a different mechanism from score-based generation; training a diffusion model is estimating the time-dependent score, and sampling from it is integrating the reverse SDE — which, discretized, is Langevin dynamics with a schedule.

flowchart LR
  subgraph Forward["forward SDE — fixed, no learning"]
    A["x₀ data"] -->|"dx = f dt + g dw"| B["x₁ pure noise"]
  end
  subgraph Reverse["reverse SDE — needs the score"]
    C["x₁ noise"] -->|"dx = [f − g²∇ₓlog pₜ] dt + g dw̄"| D["x₀ sample"]
  end
  S["score network sθ(x,t)<br/>trained by denoising score matching"] -.->|"supplies ∇ₓ log pₜ(x)"| Reverse

This is where the §18.10 and §18.11 pitfalls dissolve. Recall the twin problems: the learned score is garbage in low-density regions, and Langevin cannot cross between modes. The forward SDE fixes both at once by indexing the score on a continuum of noise levels \(t\). At high noise (large \(t\)) the distribution \(p_t\) is broad and nearly Gaussian — its score is well-defined everywhere, including the empty gaps, and the landscape has one big basin so mixing is trivial. At low noise (small \(t\)) the distribution is sharp and detailed but the sampler is already near the right region. Sampling runs \(t\) from high to low: start in the easy smooth regime, follow the score, and gradually anneal toward the sharp data distribution. This is annealed Langevin dynamics, and it is exactly what the reverse SDE prescribes.

Training uses denoising score matching, the scalable cousin promised in §18.10. The insight (Vincent, 2011) is that for data corrupted by Gaussian noise of variance \(\sigma^2\), the score of the noised distribution has a closed form: it points from the noisy sample back toward the clean one. Concretely, if \(\tilde x = x + \sigma\,\epsilon\) with \(\epsilon \sim \mathcal N(0,I)\), then

\[ \nabla_{\tilde x} \log p_\sigma(\tilde x \mid x) \;=\; -\,\frac{\tilde x - x}{\sigma^2} \;=\; -\,\frac{\epsilon}{\sigma}. \]

In words: the best direction to denoise a corrupted sample is simply “head back toward the clean original,” scaled by the noise level — which is exactly the negative of the noise that was added. Also written: equivalently \(s_\theta(\tilde x) \approx -\epsilon/\sigma\), so learning the score is the same task as predicting the added noise \(\epsilon\) — the link to the diffusion noise-prediction objective.

So matching the score reduces to predicting the noise that was added — no Jacobian trace, no partition function, just a regression. The training objective becomes the familiar denoising loss

\[ \mathcal L(\theta) \;=\; \mathbb{E}_{t,\,x,\,\epsilon}\Big[\,\lambda(t)\,\big\| s_\theta(\tilde x, t) + \tfrac{\epsilon}{\sigma_t} \big\|^2\,\Big], \]

In words: across all noise levels, push the model’s score to equal “minus the added noise over sigma,” weighted so each noise level contributes its fair share. Also written: reparametrizing \(s_\theta = -\epsilon_\theta/\sigma_t\) turns this into \(\mathbb{E}[\lambda(t)\sigma_t^{-2}\|\epsilon - \epsilon_\theta(\tilde x,t)\|^2]\) — exactly the diffusion noise-prediction MSE of §18.5, up to the weighting.

which is, up to the weighting \(\lambda(t)\) and a reparametrization, identical to the noise-prediction loss the diffusion chapter derived from a completely different starting point. Two derivations, one objective.

view object learned training signal sampling procedure
EBM / score matching score \(\nabla_x \log p(x)\) Fisher divergence (or denoising) Langevin dynamics
Diffusion (discrete) noise \(\epsilon\) to remove per step denoising MSE per timestep iterative denoising chain
Score-SDE (continuous) time-indexed score \(s_\theta(x,t)\) denoising score matching integrate reverse SDE
Tip

The score-SDE picture also hands you a bonus: every reverse SDE has a deterministic twin called the probability-flow ODE that shares the same time-marginals \(p_t\) but removes the noise term. Integrating that ODE instead of the SDE gives a deterministic sampler — faster, fewer steps, and crucially it provides an exact invertible map between data and noise, which means you can compute exact likelihoods and do clean latent interpolation. This is the bridge from diffusion back to normalizing flows: the probability-flow ODE is a continuous-time flow whose velocity field is the learned score.

The conceptual takeaway closes the loop on the whole family. Energy-based models posed the problem — model any density, but pay at the partition function. Score matching removed \(Z\) by learning gradients instead of densities. Langevin dynamics turned those gradients into samples. Diffusion discovered, seemingly independently, that adding and removing noise at many scales makes the whole scheme work on real high-dimensional data. The score-SDE framework shows these were always the same idea: learn the score of data smoothed across all noise levels, then integrate backward from noise to data. What looked like four separate techniques is one continuous theory seen through different windows.

18.13 — Vector-Quantized Models: Discrete Latent Codes

Every latent space so far has been continuous — a VAE’s \(z\) is a vector of real numbers, an autoencoder’s bottleneck is a smooth cloud. But much of the data we care about is more naturally described as a sequence of discrete symbols: a sentence is a string of words, and it turns out images and audio compress beautifully into a small alphabet of reusable visual or acoustic “words” too. The VQ-VAE (Vector-Quantized VAE) builds exactly that — an autoencoder whose latent code is forced to come from a fixed, finite codebook of vectors rather than the whole continuous space.

The intuition: imagine the decoder can only paint using a palette of, say, 512 named colors. The encoder looks at each patch of the image, finds the nearest palette entry, and writes down its index. The reconstruction is built entirely from palette colors. Because the codebook is small and shared, the model is pushed to discover a compact vocabulary of recurring parts — eyes, edges, textures — and the image becomes a grid of integer indices, exactly the form a language model knows how to predict.

Concretely, the encoder produces a continuous vector \(z_e(x)\), and quantization snaps it to its nearest neighbor among the \(K\) codebook vectors \(\{e_1, \dots, e_K\}\):

\[ z_q(x) = e_k, \qquad k = \arg\min_j \lVert z_e(x) - e_j \rVert_2 \]

In words: replace the encoder’s raw vector with whichever codebook entry sits closest to it, and remember that entry’s index. Also written: \(z_q(x) = e_{k}\) with \(k = \operatorname{argmin}_j \lVert z_e(x) - e_j \rVert_2^2\) — squaring the distance changes nothing about which entry wins, since \(\arg\min\) ignores monotone transforms.

That \(\arg\min\) has zero gradient almost everywhere, which would block training. The fix is the straight-through estimator: in the backward pass, pretend the quantizer was the identity and copy the decoder’s gradient straight back to the encoder, as if \(z_q = z_e\). The full loss has three parts — reconstruction, plus two terms that pull the codebook and the encoder toward each other:

\[ \mathcal{L} = \underbrace{\lVert x - g(z_q) \rVert^2}_{\text{reconstruction}} + \underbrace{\lVert \operatorname{sg}[z_e] - e \rVert^2}_{\text{codebook}} + \beta \underbrace{\lVert z_e - \operatorname{sg}[e] \rVert^2}_{\text{commitment}} \]

In words: rebuild the input well, drag the chosen codebook vector toward the encoder’s output, and (more gently) drag the encoder’s output toward its chosen codebook vector so it commits to a code instead of drifting. Also written: with \(\operatorname{sg}[\cdot]\) the stop-gradient operator, the codebook term moves only \(e\) and the commitment term moves only \(z_e\) — the same squared distance, but each copy lets gradient flow to just one side.

Here \(\operatorname{sg}[\cdot]\) is the stop-gradient operator — it passes its value forward but blocks gradient in the backward pass, which is how the two near-identical squared-distance terms end up updating different parameters. The commitment weight \(\beta\) (often \(\approx 0.25\)) keeps the encoder from hopping between codes.

flowchart LR
  X["x"] --> E["encoder"] --> ZE["z_e (continuous)"]
  ZE --> Q["nearest-neighbor<br/>lookup in codebook"]
  CB["codebook<br/>e₁ … e_K"] --> Q
  Q --> ZQ["z_q = e_k (discrete)"]
  ZQ --> D["decoder"] --> XH["x̂"]

import torch, torch.nn as nn, torch.nn.functional as F

class VectorQuantizer(nn.Module):
    def __init__(self, K=512, dim=64, beta=0.25):
        super().__init__()
        self.codebook = nn.Embedding(K, dim)
        self.codebook.weight.data.uniform_(-1/K, 1/K)
        self.beta = beta
    def forward(self, z_e):                       # z_e: (B, dim)
        d = torch.cdist(z_e, self.codebook.weight) # distance to every code
        k = d.argmin(dim=1)                        # nearest code index
        z_q = self.codebook(k)                     # quantized vector
        codebook_loss   = F.mse_loss(z_q, z_e.detach())
        commitment_loss = F.mse_loss(z_q.detach(), z_e)
        loss = codebook_loss + self.beta * commitment_loss
        z_q = z_e + (z_q - z_e).detach()           # straight-through estimator
        return z_q, k, loss

The line z_q = z_e + (z_q - z_e).detach() is the straight-through trick in one expression: the forward value equals z_q, but the gradient flows as if it were z_e.

The real power shows up in two-stage generation. Stage one trains the VQ-VAE to turn images into grids of codebook indices and back. Stage two throws away the pixels and trains a powerful autoregressive model (a PixelCNN or a Transformer, §18.6) over those index grids — learning \(p(\text{indices})\) exactly the way a language model learns \(p(\text{tokens})\). To generate, you sample a fresh grid of indices from the prior, look them up in the codebook, and decode to a full image. This factorization — a learned discrete tokenizer plus a sequence model over the tokens — is the blueprint behind VQ-VAE-2, VQGAN (which adds an adversarial loss for crisper textures), DALL·E (a Transformer over image tokens conditioned on text), and the neural audio codecs (SoundStream, EnCodec) that tokenize sound for speech and music models.

Model Latent What it adds over VQ-VAE
VQ-VAE Discrete codes + PixelCNN prior The base recipe
VQ-VAE-2 Hierarchical (coarse + fine) codes Multi-scale codebooks for sharper, larger images
VQGAN Discrete codes + adversarial + perceptual loss Crisper textures; pairs with a Transformer prior
DALL·E (v1) Image tokens, text-conditioned Transformer Text-to-image via next-token prediction
EnCodec / SoundStream Discrete audio codes (residual VQ) Tokenizes waveforms for audio language models
Warning

The classic VQ-VAE failure is codebook collapse: a handful of codes get used for everything while the rest go dead, shrinking the effective vocabulary. Common cures are re-initializing unused codes to recent encoder outputs, updating the codebook with an exponential moving average instead of gradients, or replacing hard nearest-neighbor with a soft/Gumbel assignment during early training.

18.14 — Fast Sampling: Distillation and Consistency Models

Diffusion’s one glaring weakness is speed. A GAN paints an image in a single forward pass; a diffusion model traditionally needs tens to hundreds of denoising steps, each a full U-Net evaluation. For interactive tools and on-device generation that is far too slow. The last few years produced a clean line of attack: keep diffusion’s stable training and excellent quality, but collapse the long sampling chain into a handful of steps — or even one.

The first lever is a better ODE solver. Recall the probability-flow ODE from §18.12: sampling is really just integrating a deterministic differential equation from noise to data, and integrating an ODE is a well-studied numerical problem. Generic samplers like DDIM, DPM-Solver, and higher-order exponential integrators take much larger, smarter steps than naive Euler, cutting a 1000-step process down to 20–50 steps with no retraining at all — you just swap the sampler. This is the easy, free win and the first thing to reach for.

The second lever is distillation: train a fast student network to reproduce, in few steps, what the slow teacher produces in many. Progressive distillation halves the step count repeatedly — a student learns to take one step that matches two teacher steps, then becomes the teacher for the next halving — walking 1024 steps down to 8, then 4, then 2.

The most elegant idea in this family is the consistency model. Its insight: along the entire probability-flow trajectory from noise to a given clean image, every point should map back to the same clean image. So train a single network \(f_\theta(x_t, t)\) to predict that common endpoint directly, from any noise level, and enforce that its predictions are self-consistent across adjacent points on the trajectory:

\[ f_\theta(x_t, t) \approx f_\theta(x_{t'}, t') \quad \text{for points on the same trajectory}, \qquad f_\theta(x, 0) = x \]

In words: wherever you are along the path from static to a particular image, the network should jump straight to that same final image — and at zero noise it must return the image untouched. Also written: the boundary condition \(f_\theta(x,0)=x\) plus trajectory-invariance is what makes a single evaluation a valid one-step generator; more steps just refine it.

Because the network is trained to leap straight to the answer, generation can be single-step — draw noise, evaluate \(f_\theta\) once, done — yet you can optionally take a few more steps to trade compute for quality, the same graceful dial diffusion offers. Consistency models can be distilled from a pretrained diffusion teacher or trained standalone from scratch.

flowchart LR
  N["x_T noise"] -.->|"diffusion: T small steps"| I1["image"]
  N2["x_T noise"] -->|"consistency: 1 big jump f_θ"| I2["image"]
  XT["any xₜ on the path"] --> F["f_θ(xₜ,t)"] --> SAME["same clean x₀"]

In practice you pull a distilled few-step or one-step model straight off the shelf rather than training one:

# A distilled few-step diffusion model: high quality in ~4 steps
from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained(
    "stabilityai/sdxl-turbo", torch_dtype=torch.float16
).to("cuda")
# guidance off, a handful of steps instead of 50
img = pipe("a red fox in snow, photorealistic",
           num_inference_steps=4, guidance_scale=0.0).images[0]
many small denoising steps vs. one consistency leap diffusion: noise → … → image (T steps) consistency: noise → image (1 step)
Approach Steps Retraining? Idea
Better ODE solver (DDIM, DPM-Solver) 20–50 None Larger, smarter integration steps
Progressive distillation 4–8 Distill from teacher Halve step count repeatedly
Consistency model 1–4 Distill or from scratch Map any noise level to the same endpoint
Tip

The speed ladder, cheapest first: swap in a better ODE sampler (free, no training) → distill to a few-step student → distill or train a one-step consistency model. Each rung buys speed; the lower rungs cost a little quality, so stop at the first one fast enough for your use case.

18.15 — Quick reference

Model / term What it is When / why to reach for it
Autoencoder Encoder–bottleneck–decoder trained to reconstruct its input Compression, denoising, anomaly detection; baseline for learned representations
Denoising / sparse AE Autoencoder fed corrupted input or penalized for dense codes Force structure-aware or interpretable features instead of a memorized copy
VAE Encodes a distribution per input; trained on the ELBO Need a smooth latent space to sample, interpolate, or optimize over
Reparameterization trick Write \(z=\mu+\sigma\odot\epsilon\) so gradients flow past sampling The reason a VAE can be trained by backprop at all
GAN Generator vs. discriminator in a minimax game Sharp images in one forward pass when you don’t need a likelihood
DCGAN GAN with convolutional generator/discriminator The default “basic image GAN” architecture
cGAN GAN conditioned on a label in \(G\) and \(D\) Class-controlled generation (“make a 7”); basis of pix2pix
CycleGAN Two generators + cycle-consistency loss Unpaired image-to-image translation (horse↔︎zebra)
StyleGAN Style vector injected at every resolution Controllable, photorealistic faces; style mixing
Diffusion model Learn to reverse a fixed noising process via MSE on a U-Net Best quality, stable training; today’s state of the art for images
Classifier-free guidance Extrapolate from unconditional toward conditional score Steer diffusion with a prompt; trade diversity for prompt adherence
Latent diffusion Run diffusion in a compressed latent grid Make text-to-image fast enough to be practical (Stable Diffusion)
Autoregressive model Factor \(p(x)=\prod_i p(x_i\mid x_{<i})\), one element at a time Exact likelihood + top quality; accept slow sequential sampling
Normalizing flow Invertible net + change-of-variables Exact density and fast sampling; limited expressiveness
RBM Bipartite energy-based model trained with contrastive divergence Historical: layer-wise pretraining; cleanest energy-based example
Score matching Learn \(\nabla_x\log p(x)\), which has no partition function Train energy-based models without computing \(Z\)
Langevin dynamics Drift along the score + calibrated noise Turn a learned score field into a sampler
VQ-VAE / VQGAN Autoencoder with a discrete codebook latent Tokenize images/audio so a Transformer can generate them
Consistency model Maps any noise level straight to the clean sample One- to few-step generation when diffusion is too slow

18.16 — Key takeaways

  • Generative models learn the data distribution so they can synthesize new samples, rather than mapping inputs to labels.
  • Autoencoders compress through a bottleneck and reconstruct; denoising and sparse variants force more useful codes, but the plain latent space has holes you can’t sample from.
  • VAEs make the latent space smooth by encoding distributions, optimizing the ELBO (reconstruction − KL), with the reparameterization trick to allow backprop through sampling; samples are smooth but slightly blurry.
  • GANs pit a generator against a discriminator in a minimax game — sharp samples, no likelihood, but plagued by mode collapse and training instability; judge by samples, not loss (and stabilize with tricks like the Wasserstein critic).
  • Evaluation of generative models needs purpose-built metrics — FID (feature-distribution distance) is the standard for images, with precision/recall splitting quality from coverage.
  • GAN variants added capability: DCGAN (convolutional stability), cGAN (class-conditional generation), CycleGAN (unpaired translation via cycle-consistency), StyleGAN (controllable, photorealistic faces).
  • Diffusion models learn to reverse a fixed noising process with a stable MSE objective on a U-Net; classifier-free guidance steers them with prompts and latent diffusion makes them efficient — now the state of the art.
  • Autoregressive and flow-based models give exact likelihoods: autoregressive factorizes \(p(x)\) via the chain rule (slow sequential sampling), flows use invertible transforms and change-of-variables (fast, but limited expressiveness).
  • Boltzmann machines / RBMs are energy-based models trained with contrastive divergence; historically pivotal for deep-learning pretraining and conceptually echoed in today’s score-based diffusion.
  • Energy-based models, score matching, Langevin dynamics, and diffusion are one theory — learn the score of data smoothed across noise levels, then integrate backward from noise to data; the partition function is dodged by modeling gradients, not densities.
  • Vector-quantized models (VQ-VAE/VQGAN) turn data into grids of discrete codebook tokens, so a Transformer can generate by next-token prediction — the recipe behind DALL·E and neural audio codecs.
  • Fast sampling closes diffusion’s speed gap: better ODE solvers need no retraining, distillation collapses the chain to a few steps, and consistency models map any noise level straight to the final sample for one-step generation.

18.17 — See also

  • Neural Networks (Core) — the encoder, decoder, generator, and discriminator are all standard nets; backprop and activations underpin everything here.
  • Convolutional Neural Networks — the backbone of DCGAN, StyleGAN, and the U-Net inside diffusion models.
  • Dimensionality Reduction — autoencoders are nonlinear cousins of PCA; the latent space is a learned low-dimensional manifold.
  • Probability & Statistics — KL divergence, Gaussian priors, the ELBO, the change-of-variables formula, and the Boltzmann distribution all live here.
  • Large Language Models — autoregressive text generation, the other great branch of generative modeling; the same next-token machinery powers VQ-VAE’s discrete-token priors.
  • Computer Vision — the primary proving ground for GANs and diffusion (image synthesis, translation, super-resolution).
  • Probabilistic Graphical Models — Boltzmann machines are undirected graphical models; the energy/partition-function machinery originates there.

↪ The thread continues → Chapter 19 · 👁️ Computer Vision

We now have the architectures. The next chapters put them to work on the real world, beginning with the sense that consumes most of AI’s compute — vision.


📖 All chapters  |  ← 17 · ⚡ Attention & Transformers  |  19 · 👁️ Computer Vision →

 

© Kader Mohideen