Chapter 13 — 🕸️ Probabilistic Graphical Models

📖 All chapters | ← 12 · 🎯 Model Evaluation & Tuning | 14 · 🧠 Neural Networks (Core) →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Probabilistic graphical models (PGMs) are the bridge between probability theory and graph theory: they let you draw a picture of how random variables depend on one another, and then read off — and compute — the answers to questions like “given what I observed, what’s likely true about what I didn’t?” They matter because real systems are full of uncertainty and structure at the same time, and a PGM captures both: the graph encodes which variables interact, and the probabilities quantify how strongly. This chapter sits in the Probabilistic Models part of the roadmap, downstream of probability and statistics and upstream of the deep generative models that inherit its vocabulary (latent variables, inference, the ELBO).

🧭 In context: Probabilistic Models · used for reasoning under uncertainty, structured prediction, and modeling hidden causes · the one key idea: a graph factorizes a giant joint distribution into small, local pieces you can actually compute with.

💡 Remember this: the graph’s missing edges are the model — they declare which variables are conditionally independent, and those independencies are exactly what shrink an intractable joint distribution into small local factors you can compute and infer with.

13.1 — Bayesian Networks

A Bayesian network (or belief network) is a directed acyclic graph (DAG) in which each node is a random variable and each arrow points from a cause to an effect (more precisely, from a variable to one that depends directly on it). The whole point is factorization: instead of storing one enormous joint probability table over all variables, you store one small conditional probability table (CPT) per node, giving the probability of that node given its parents. The joint is then the product of these local tables:

\[P(X_1, \dots, X_n) = \prod_{i=1}^{n} P(X_i \mid \text{parents}(X_i))\]

In words: the probability of the whole world is the product of “each variable given just the variables that point at it” — every node only has to know about its own parents. Also written: \(\log P(X_1,\dots,X_n) = \sum_{i=1}^{n} \log P(X_i \mid \text{parents}(X_i))\) — turning the product into a sum of local log-tables, which is how engines actually accumulate it.

Think of it like a company org chart: to know what any one person does, you only need to know what their direct manager told them, not the entire company’s state. The arrows carry the “orders,” and each person keeps a small lookup card (“if my manager says X, I do Y with probability p”) instead of one company-wide rulebook.

The savings are dramatic. Five binary variables have a joint table of \(2^5 = 32\) entries. If the graph says each variable has at most two parents, the CPTs together hold far fewer numbers — and each number is something a human or a dataset can actually estimate.

The classic teaching example is the sprinkler network. It rained or it didn’t; the sprinkler was on or off; either can make the grass wet. Cloudy weather makes rain more likely and the sprinkler less likely.

graph TD
    C[Cloudy] --> R[Rain]
    C --> S[Sprinkler]
    S --> W[Grass Wet]
    R --> W

This DAG asserts a specific factorization:

\[P(C, S, R, W) = P(C)\,P(S \mid C)\,P(R \mid C)\,P(W \mid S, R)\]

In words: weather happens on its own; sprinkler and rain each depend only on weather; wet grass depends only on sprinkler and rain. Also written: as one chain-rule product with the irrelevant conditioning dropped thanks to independence — \(P(C)\,P(S\mid C)\,P(R\mid C,\cancel{S})\,P(W\mid S,R,\cancel{C})\).

Count the savings concretely. The full joint over four binary variables needs \(2^4 - 1 = 15\) free numbers. The factored form needs only \(1\) (for \(C\)) \(+ 2\) (for \(S \mid C\)) \(+ 2\) (for \(R \mid C\)) \(+ 4\) (for \(W \mid S,R\)) \(= 9\). With four variables the gap is small; with fifty sparsely-connected variables it is the difference between a table that fits in memory and one with more entries than there are atoms in the room.

Conditional independence is what the missing arrows encode. There is no arrow from Cloudy to Grass Wet, which says: once you know whether it Rained and whether the Sprinkler was on, learning that it was Cloudy tells you nothing more about wet grass. Formally \(W \perp\!\!\!\perp C \mid \{S, R\}\). These independencies are exactly what make the factorization valid and inference cheap.

d-separation (“directed separation”) is the graphical rule that reads every such independence straight off the graph, without touching a single number. Walk along any path between two nodes; the path is blocked (carries no dependence) according to three local patterns:

Pattern	Shape	Blocked when…
Chain	\(A \to B \to C\)	\(B\) is observed
Fork (common cause)	\(A \leftarrow B \to C\)	\(B\) is observed
Collider (common effect)	\(A \to B \leftarrow C\)	\(B\) and all its descendants are unobserved

Two variables are d-separated (hence independent) given a set of evidence if every path between them is blocked. The collider is the famous trap: conditioning on a common effect creates dependence between its causes. This is explaining away — if the grass is wet and you learn the sprinkler was on, rain becomes less likely, even though rain and sprinkler are independent a priori.

The three patterns drawn side by side make the logic visible — the dot marks the node you condition on, and the slash marks where the flow of information is cut:

Chain: A→B→C A B C observe B → A⊥C

Fork: A←B→C A B C observe B → A⊥C

Collider: A→B←C A B C observe B → A,C coupled!

Tip

Arrows are about direct dependence, not literal causation — though if you draw them causally, the d-separation rules line up with how interventions behave (see Causal Inference). Rule of thumb: a fork or chain leaks information until you observe the middle; a collider is sealed until you observe the bottom.

In practice you rarely hand-roll a Bayes net. The pgmpy library lets you declare the structure and CPTs, then query and even d-separation-check for you:

from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

m = DiscreteBayesianNetwork([("C","R"),("C","S"),("R","W"),("S","W")])
m.add_cpds(
    TabularCPD("C", 2, [[0.5],[0.5]]),
    TabularCPD("R", 2, [[0.2,0.8],[0.8,0.2]], evidence=["C"], evidence_card=[2]),
    TabularCPD("S", 2, [[0.5,0.9],[0.5,0.1]], evidence=["C"], evidence_card=[2]),
    TabularCPD("W", 2, [[1,0.1,0.1,0.01],[0,0.9,0.9,0.99]],
               evidence=["S","R"], evidence_card=[2,2]),
)
print(m.is_dconnected("C","W", observed=["S","R"]))   # False -> W ⟂ C | {S,R}
q = VariableElimination(m).query(["R"], evidence={"W":1})
print(q.values)                                        # P(R | W=T) ≈ [0.292, 0.708]

13.2 — Worked Example: Inference in the Sprinkler Network

Let’s actually compute something. Question: the grass is wet — what’s the probability it rained? i.e. \(P(R = T \mid W = T)\).

Use these CPTs (T = true):

\(P(C{=}T) = 0.5\)
\(P(R{=}T \mid C{=}T) = 0.8\), \(\quad P(R{=}T \mid C{=}F) = 0.2\)
\(P(S{=}T \mid C{=}T) = 0.1\), \(\quad P(S{=}T \mid C{=}F) = 0.5\)
\(P(W{=}T \mid S, R)\): \(\;(F,F){=}0.0,\;(T,F){=}0.9,\;(F,T){=}0.9,\;(T,T){=}0.99\)

Inference by enumeration: sum the full joint over the variables we neither query nor observe (here \(C\) and \(S\)), then normalize. The recipe is mechanical — fix the query and evidence, loop over every combination of the rest, multiply the CPTs, add up the matching rows, divide:

import itertools
B = [True, False]
def pC(c):   return 0.5
def pR(r,c): p=0.8 if c else 0.2;   return p if r else 1-p
def pS(s,c): p=0.1 if c else 0.5;   return p if s else 1-p
def pW(w,s,r):
    p={(0,0):0.0,(1,0):0.9,(0,1):0.9,(1,1):0.99}[(int(s),int(r))]
    return p if w else 1-p

def joint(c,s,r,w): return pC(c)*pR(r,c)*pS(s,c)*pW(w,s,r)

# P(R, W=T) for each R, summing out C and S
num = {}
for r in B:
    num[r] = sum(joint(c,s,r,True) for c in B for s in B)
Z = num[True] + num[False]                 # P(W=T)
print(round(num[True]/Z, 3))               # P(R=T | W=T) -> 0.708

The answer is \(P(R{=}T \mid W{=}T) \approx 0.708\). The prior on rain (averaged over weather) was only \(0.5\); observing wet grass pushed it up to \(0.71\). That upward jump is Bayesian inference — evidence at an effect flows backward to update a cause.

The animation makes the direction of flow vivid: the arrows point down (cause → effect), but the moment you clamp wet grass to “true,” information travels back up the dashed channel to revise your belief about rain.

graph LR
    P["prior P(R=T) = 0.50"] -->|"observe W=T"| Q["posterior P(R=T | W=T) = 0.71"]

Enumeration is exact but costs \(O(2^n)\) — fine for four variables, hopeless for four hundred. Real engines use variable elimination (push sums inward so partial results are reused instead of recomputed) or belief propagation (pass “messages” — partial summaries — along the edges of the graph). When even those are intractable, you fall back on the approximate sampling and variational methods of §13.7.

Warning

Don’t forget to normalize. The unnormalized number \(\sum_{C,S} P(C,S,R{=}T,W{=}T)\) is not a probability — it’s the joint \(P(R{=}T, W{=}T)\). You must divide by \(P(W{=}T)\) to get a conditional. Skipping \(Z\) is the single most common inference bug.

13.3 — Naive Bayes: The Simplest Useful Bayesian Network

Before leaving directed models, meet the one Bayesian network you have almost certainly used already, perhaps without knowing it: the Naive Bayes classifier. It is the simplest non-trivial DAG — a single hidden “class” node fanning out to all the features — and despite a wildly unrealistic assumption it remains a strong, fast baseline for text classification to this day.

The intuition: to decide whether an email is spam, imagine each suspicious word (“free”, “winner”, “viagra”) casting an independent vote, given the class. Naive Bayes assumes that once you know the class, the features are mutually independent — every feature points out from the class, and there are no arrows between features.

graph TD
    Y["Class y (spam? topic?)"] --> X1["feature x₁"]
    Y --> X2["feature x₂"]
    Y --> X3["feature x₃"]
    Y --> Xd["feature x_d"]

That fan-out shape forces the factorization

\[P(y, x_1, \dots, x_d) = P(y)\prod_{i=1}^{d} P(x_i \mid y),\]

so classification is just picking the class that maximizes the posterior:

\[\hat{y} = \arg\max_{y}\; P(y)\prod_{i=1}^{d} P(x_i \mid y).\]

In words: start with how common each class is (the prior), multiply in how well each feature fits that class, and choose the class with the biggest product. Also written: in log form to avoid underflow, \(\hat{y} = \arg\max_{y}\big[\log P(y) + \sum_{i} \log P(x_i\mid y)\big]\) — a sum of log-scores, which (for two classes) is exactly a linear decision rule.

The “naive” assumption — features independent given the class — is usually false (in real email, “free” and “money” co-occur), yet the classifier often still ranks the right class on top because it only needs the argmax to come out right, not the probabilities to be calibrated.

A tiny worked example. Two classes, prior \(P(\text{spam})=0.4\). Suppose for the words in a message, \(\prod_i P(x_i\mid\text{spam}) = 0.010\) and \(\prod_i P(x_i\mid\text{ham}) = 0.002\). Then the unnormalized scores are \(0.4\times0.010 = 0.0040\) for spam and \(0.6\times0.002 = 0.0012\) for ham; normalizing, \(P(\text{spam}\mid\text{words}) = 0.0040/(0.0040+0.0012) \approx 0.77\) — flag it.

In scikit-learn this is two lines, and MultinomialNB is still a go-to baseline for bag-of-words text:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

texts  = ["win free money now", "team meeting at noon", "free prize claim now"]
labels = ["spam", "ham", "spam"]
X = CountVectorizer().fit_transform(texts)
clf = MultinomialNB().fit(X, labels)              # learns P(y) and P(word|y)
print(clf.classes_, clf.predict(X))               # -> ['ham' 'spam'] correctly

Tip

Naive Bayes is the generative sibling of logistic regression (which models \(P(y\mid\mathbf x)\) directly). Same fan-out picture, opposite arrow philosophy — and it foreshadows the generative-vs-discriminative split you just saw between HMMs and CRFs. With infinite data, the discriminative model usually wins; with little data, the generative one’s built-in assumptions can win.

13.4 — Markov Random Fields

Bayesian networks use directed arrows, which carry a built-in sense of cause and order. But many systems have no natural direction — the pixels of an image, the atoms in a material, the votes of neighboring districts all just influence each other mutually. For these, the right tool is an undirected graphical model: a Markov Random Field (MRF), also called a Markov network.

In an MRF, an edge between two nodes simply means “these two directly interact,” with no arrow and no parent/child roles. Because there is no DAG to multiply CPTs along, the joint is built from potential functions (also called factors) \(\psi_C\), one per clique \(C\) (a fully-connected subset of nodes). A potential is any non-negative function scoring how compatible a configuration of its nodes is — bigger means “this combination is more agreeable.” The joint multiplies them and normalizes:

\[P(\mathbf{x}) = \frac{1}{Z} \prod_{C} \psi_C(\mathbf{x}_C), \qquad Z = \sum_{\mathbf{x}} \prod_{C} \psi_C(\mathbf{x}_C)\]

In words: score every configuration by multiplying its cliques’ “compatibility” numbers, then divide by the total of all scores so the probabilities add up to 1. Also written: in exponential (Gibbs/energy) form, \(P(\mathbf{x}) = \frac{1}{Z}\exp\!\big(-E(\mathbf{x})\big)\) with energy \(E(\mathbf{x}) = -\sum_C \log \psi_C(\mathbf{x}_C)\) — low energy means high probability, exactly like a physical system settling into its most comfortable state.

Picture a roomful of opinionated neighbors deciding how to vote: each pair has a “comfort score” for agreeing or disagreeing, and the configuration the whole room settles into is the one that keeps the most pairs comfortable at once. \(Z\) is the bookkeeping that converts those raw comfort scores into honest probabilities.

\(Z\), the partition function, is the sum of the product over every joint configuration. It is what makes MRFs hard: unlike a Bayesian network (where the factorization is automatically normalized to 1), an MRF’s potentials are unnormalized, so computing \(Z\) — and hence any probability — generally means summing over exponentially many states.

The independence rule is gloriously simple here. A node is conditionally independent of the rest of the graph given its immediate neighbors — its Markov blanket. To check whether two sets of nodes are independent given evidence, just delete the observed nodes and ask: is there still a path between them? No path means independence. No colliders, no arrows, no special cases.

The canonical example is the Ising model / image grid: each pixel is a node, edges connect neighbors, and a pairwise potential rewards neighbors for agreeing. This is exactly the prior behind classic image denoising — “true images are mostly smooth, so penalize neighboring pixels that disagree.”

Here is the contrast that organizes the whole chapter — directed versus undirected:

	Bayesian network (directed)	Markov random field (undirected)
Edge meaning	\(A \to B\): \(A\) directly influences \(B\)	\(A - B\): \(A\), \(B\) mutually interact
Building block	conditional table \(P(X \mid \text{parents})\)	potential \(\psi_C(\mathbf{x}_C)\) over a clique
Normalization	automatic (each CPT sums to 1)	needs global \(Z\) (hard)
Independence test	d-separation (colliders are tricky)	graph separation given neighbors
Natural fit	causal / temporal structure	spatial / relational / symmetric structure

A tiny two-node MRF makes \(Z\) concrete. Suppose two binary nodes \(A, B\) with one potential \(\psi(A,B)\) that rewards agreement: \(\psi(0,0)=\psi(1,1)=2\), \(\psi(0,1)=\psi(1,0)=1\).

psi = {(0,0):2,(1,1):2,(0,1):1,(1,0):1}
Z = sum(psi.values())                       # 2+2+1+1 = 6
P = {k: v/Z for k,v in psi.items()}         # normalized joint
print(P[(1,1)], P[(0,1)])                   # 0.333 vs 0.167 — agreeing is 2x likelier

Warning

You cannot always convert a directed model to an undirected one (or vice versa) without losing independence information. Turning a Bayesian network into an MRF requires moralization — “marrying” the unconnected parents of every collider with an edge — which destroys the very explaining-away independence the collider encoded. The two families overlap but neither contains the other.

13.5 — Markov Models & Hidden Markov Models

A Markov chain models a sequence of states where the future depends on the present but not the past: \(P(s_t \mid s_{t-1}, \dots, s_1) = P(s_t \mid s_{t-1})\). This Markov property (“the present screens off the past”) is captured by a transition matrix \(A\) where \(A_{ij} = P(s_t{=}j \mid s_{t-1}{=}i)\). Weather modeled as “Sunny/Rainy with sticky days” is a Markov chain you can observe directly.

A Hidden Markov Model (HMM) adds a twist: you cannot see the states. You see only noisy emissions they produce. The state is hidden; the observation is a clue. Three ingredients define it:

States \(s_t\), hidden, evolving by transition matrix \(A\).
Emissions \(o_t\), observed, generated from the current state by emission matrix \(B\), where \(B_{jk} = P(o_t{=}k \mid s_t{=}j)\).
Initial distribution \(\pi\) over the first state.

graph LR
    S1((s1)) --> S2((s2)) --> S3((s3))
    S1 --> O1[o1]
    S2 --> O2[o2]
    S3 --> O3[o3]

A canonical example: a friend’s mood (Happy/Sad — hidden) drives what they post online (emoji vs. rant — observed). You never see the mood directly; you infer it from the posts. HMMs powered classical speech recognition (hidden phonemes, observed audio), part-of-speech tagging, and gene-finding before neural sequence models took over.

There are three classic questions you ask of an HMM, each with its own dynamic-programming algorithm:

Question	Algorithm	What it gives
How likely is this observation sequence?	Forward	\(P(o_{1:T})\)
What hidden states explain it? (per step)	Forward–Backward	\(P(s_t \mid o_{1:T})\) for each \(t\)
What single best state path?	Viterbi	\(\arg\max_{s_{1:T}} P(s_{1:T} \mid o_{1:T})\)

The forward algorithm is dynamic programming over a quantity it calls \(\alpha_t(j) = P(o_{1:t}, s_t{=}j)\), the probability of the evidence seen so far and of being in state \(j\) right now. It builds left to right: to be in state \(j\) at time \(t\), you must have been in some state \(i\) at \(t-1\) (sum over those, weighted by the transition \(A_{ij}\)), then emit the current observation:

\[\alpha_t(j) = \Big[\sum_i \alpha_{t-1}(i)\,A_{ij}\Big]\,B_j(o_t)\]

In words: the chance of “everything so far and being in state \(j\) now” = (sum over wherever you could have come from) × (chance this state emits today’s observation). Also written: vectorized, \(\boldsymbol{\alpha}_t = \big(A^{\top}\boldsymbol{\alpha}_{t-1}\big) \odot \mathbf{b}(o_t)\), where \(\odot\) is elementwise product and \(\mathbf{b}(o_t)\) is the column of emission probabilities for today’s symbol.

The trellis below shows the recursion sweeping left to right: each column lights up only after the one before it is filled, because \(\alpha_t\) is built entirely from \(\alpha_{t-1}\).

Forward–backward adds a symmetric backward pass \(\beta_t(i) = P(o_{t+1:T} \mid s_t{=}i)\) that summarizes the future, then multiplies the two: \(P(s_t{=}i \mid o_{1:T}) \propto \alpha_t(i)\,\beta_t(i)\). This is smoothing — judging each step using the whole sequence, future included, not just the past.

Viterbi is the same forward recursion with \(\max\) replacing \(\sum\), plus backpointers so you can retrace the winning path:

import numpy as np
A = np.array([[0.7,0.3],[0.4,0.6]])   # transitions: rows=from, cols=to
B = np.array([[0.9,0.1],[0.2,0.8]])   # emissions: P(obs | state)
pi = np.array([0.6,0.4])
obs = [0,1,1]                          # observed emission indices

T,N = len(obs), 2
delta = np.zeros((T,N)); ptr = np.zeros((T,N),int)
delta[0] = pi * B[:,obs[0]]                       # init
for t in range(1,T):
    for j in range(N):
        scores = delta[t-1]*A[:,j]                # best way into j
        ptr[t,j] = scores.argmax()
        delta[t,j] = scores.max()*B[j,obs[t]]
path = [int(delta[-1].argmax())]                  # backtrack
for t in range(T-1,0,-1): path.insert(0, ptr[t,path[0]])
print(path)                                       # most likely hidden states -> [0, 1, 1]

The Viterbi path is globally optimal as a sequence — which is not the same as stringing together the per-step most-likely states from forward–backward, because Viterbi respects which transitions are even possible. (Glue together independently-best steps and you can produce a path that includes a forbidden jump.) Learning the matrices \(A\), \(B\), \(\pi\) from unlabeled emissions is done by Baum–Welch, which is just the EM algorithm of §13.8 specialized to HMMs.

Tip

Always run these recursions in log space. Multiplying hundreds of probabilities underflows to zero in floating point; adding logs (and using logsumexp in place of the forward sum) keeps you numerically safe with essentially no extra code.

In real code you reach for hmmlearn, which packages all three algorithms plus Baum–Welch fitting behind a scikit-learn-style API:

import numpy as np
from hmmlearn import hmm

model = hmm.CategoricalHMM(n_components=2)          # 2 hidden states
model.startprob_     = np.array([0.6, 0.4])
model.transmat_      = np.array([[0.7,0.3],[0.4,0.6]])
model.emissionprob_  = np.array([[0.9,0.1],[0.2,0.8]])

obs = np.array([[0],[1],[1]])                       # column of emission indices
logprob, states = model.decode(obs, algorithm="viterbi")
print(states)                                       # most likely path -> [0 1 1]
print(model.predict_proba(obs).round(3))           # forward–backward smoothing per step
# model.fit(long_sequence)  # Baum–Welch / EM learns A, B, π from unlabeled data

13.6 — Conditional Random Fields

An HMM is generative: it models the full joint \(P(\text{states}, \text{observations})\), including a story for how observations are produced. But for labeling tasks — tag each word in a sentence with its part of speech — you only ever care about \(P(\text{labels} \mid \text{observations})\). Modeling how the words themselves arose is wasted effort, and worse, it forces an awkward independence assumption: each observation must depend only on its own state, so you can’t easily use overlapping clues like “is this word capitalized and preceded by ‘Mr.’”.

A Conditional Random Field (CRF) is the discriminative answer, and it is exactly the undirected, conditional cousin of the HMM — a Markov random field (§13.4) over the labels, conditioned on the input. It models \(P(\mathbf{y} \mid \mathbf{x})\) directly, where \(\mathbf{x}\) is the whole input sequence and \(\mathbf{y}\) the label sequence. A linear-chain CRF scores label sequences with feature functions \(f_k\) and learned weights \(w_k\):

\[P(\mathbf{y} \mid \mathbf{x}) = \frac{1}{Z(\mathbf{x})} \exp\!\Big(\sum_t \sum_k w_k\, f_k(y_{t-1}, y_t, \mathbf{x}, t)\Big)\]

In words: add up weighted “votes” from features that each inspect a label-pair-plus-context, exponentiate the total into a score, then normalize over all possible label sequences. Also written: as a softmax over label paths, \(P(\mathbf{y}\mid\mathbf{x}) = \mathrm{softmax}_{\mathbf y}\big(\mathbf{w}^{\top}\mathbf{F}(\mathbf{x},\mathbf{y})\big)\) with \(\mathbf{F} = \sum_t \mathbf{f}(y_{t-1},y_t,\mathbf{x},t)\) the summed feature vector — a logistic-regression-over-sequences.

Think of each feature as a hiring panelist: one shouts “+3, this word ends in -ing so it’s probably a verb,” another “+5, previous tag was a determiner so a noun fits here.” The CRF tallies all the panelists’ weighted votes for every candidate labeling of the sentence and picks the labeling with the highest total — but, crucially, panelists may look at the whole résumé (\(\mathbf{x}\)), not just the one word.

The liberating part is that each feature can look at the entire input \(\mathbf{x}\) — the word’s suffix, whether it’s capitalized, the previous and next words, anything — precisely because we never have to explain how \(\mathbf{x}\) arose. The normalizer \(Z(\mathbf{x})\) sums over all label sequences and is computed by the very same forward algorithm as the HMM. Note that \(Z(\mathbf{x})\) here depends on the input, so it is recomputed per example — cheaper than an MRF’s global \(Z\) because it only sums over label paths for one sentence.

graph TD
    subgraph "HMM (generative): state generates obs"
    HA((y_t-1)) --> HB((y_t))
    HB --> HX[x_t]
    end
    subgraph "CRF (discriminative): label conditions on all x"
    CA((y_t-1)) --- CB((y_t))
    CX[x: full sequence] -.-> CA
    CX -.-> CB
    end

The contrast in one table:

	HMM	Linear-chain CRF
Models	joint \(P(\mathbf{x}, \mathbf{y})\)	conditional \(P(\mathbf{y} \mid \mathbf{x})\)
Type	generative	discriminative
Features	one emission per state	arbitrary, overlapping, whole-\(\mathbf{x}\)
Trained to	explain the data	get the labels right
Cost	cheap, closed-form counts	needs gradient optimization

For sequence labeling — named-entity recognition, POS tagging — CRFs historically beat HMMs because they exploit rich, overlapping features. Today a CRF layer often sits on top of a neural network (the BiLSTM-CRF, built on the recurrent sequence models that succeeded HMMs) to enforce valid label transitions (for example, an I-PER tag cannot follow a B-LOC tag); the network proposes per-word scores and the CRF picks the best globally-consistent label sequence with Viterbi.

A linear-chain CRF for sequence labeling is a few lines with sklearn-crfsuite, where features are just dictionaries of whatever clues you can dream up per token:

from sklearn_crfsuite import CRF

def feats(sent, i):
    w = sent[i]
    return {"lower": w.lower(), "is_cap": w[0].isupper(),
            "suffix3": w[-3:], "prev": "" if i==0 else sent[i-1].lower()}

sent = ["Mr", "Smith", "lives", "in", "Paris"]
X = [[feats(sent, i) for i in range(len(sent))]]
y = [["B-PER","I-PER","O","O","B-LOC"]]            # one labeled example

crf = CRF(algorithm="lbfgs", max_iterations=50)
crf.fit(X, y)                                       # learns weights w_k
print(crf.predict(X))                               # Viterbi-decoded best label path

The overlapping features (is_cap, prev, suffix3) are exactly what an HMM cannot use — each one peeks at arbitrary parts of the input.

Warning

A CRF cannot tell you \(P(\mathbf{x})\) — it never modeled the inputs, so it can’t say whether a new input is unusual, and it can’t generate synthetic data. If you need anomaly detection on the inputs or want to sample new sequences, you need a generative model. Discriminative sharpness is bought at the price of generative ability.

13.7 — Inference: Exact and Approximate

Everything above hinges on one operation: given some evidence, compute a posterior or a marginal. This section names the toolbox. The split is between exact methods (right answer, but only feasible when the graph is small or tree-like) and approximate methods (the practical fallback when exactness is hopeless).

On the exact side, the workhorse is variable elimination: to marginalize out a variable, gather every factor that mentions it, multiply them, sum the variable out, and replace them all with the single resulting factor. Do this in a good order and you reuse partial sums instead of recomputing the whole joint — turning the \(O(2^n)\) of naive enumeration into something tractable for sparse graphs. Belief propagation (the sum-product algorithm) systematizes this as messages passed along edges: each node tells its neighbor “here is my summarized belief.” On a tree this is exact and efficient; on a graph with loops the same updates, run anyway, give loopy belief propagation — an approximation that often works surprisingly well.

When the graph is too tangled even for those, you sample. The idea of Markov Chain Monte Carlo (MCMC) is to build a random walk over configurations whose long-run visiting frequency equals the posterior you can’t compute directly — then just count how often the walk visits each state. Gibbs sampling is the simplest version: cycle through the variables, and resample each one from its conditional given the current values of all the others (which, by the Markov-blanket property, only needs its neighbors). Here is Gibbs sampling estimating \(P(R\mid W{=}T)\) in the sprinkler network, reusing the CPTs from §13.2:

import random
def pR(r,c): p=0.8 if c else 0.2; return p if r else 1-p
def pS(s,c): p=0.1 if c else 0.5; return p if s else 1-p
def pC(c):   return 0.5
def pW(w,s,r): p={(0,0):0.,(1,0):.9,(0,1):.9,(1,1):.99}[(int(s),int(r))]; return p if w else 1-p

C,S,R = False,False,True            # init; W is clamped to True (evidence)
counts = 0; N = 200000
for i in range(N):
    # resample each hidden var from its conditional ∝ product of factors touching it
    C = random.random() < (pC(True)*pR(R,True)*pS(S,True)) / \
        (pC(True)*pR(R,True)*pS(S,True) + pC(False)*pR(R,False)*pS(S,False))
    R = random.random() < (pR(True,C)*pW(True,S,True)) / \
        (pR(True,C)*pW(True,S,True) + pR(False,C)*pW(True,S,False))
    S = random.random() < (pS(True,C)*pW(True,True,R)) / \
        (pS(True,C)*pW(True,True,R) + pS(False,C)*pW(True,False,R))
    if i > 1000: counts += R         # discard burn-in, then tally
print(round(counts/(N-1001), 3))     # ≈ 0.708, matching exact inference

The sampler lands on the same \(0.708\) we computed exactly — but it would keep working on a graph with thousands of variables where enumeration is impossible. The other major approximate family, variational inference, replaces sampling with optimization; because it shares its machinery with EM, it gets its own section next.

Warning

MCMC needs babysitting. Two practical pitfalls hide in that loop. (1) Burn-in: the walk starts at a possibly-silly state and needs time to “forget” it, so you discard the first samples (the i > 1000 guard above). (2) Convergence: the only honest way to know the chain has mixed is to check — run several chains from different starts and confirm they agree. The standard diagnostic is the Gelman–Rubin statistic \(\hat{R}\), which compares within-chain to between-chain variance; \(\hat{R} \approx 1.0\) (say, below 1.01) means the chains have converged, while \(\hat{R} \gg 1\) means keep sampling. Libraries like PyMC and NumPyro compute \(\hat{R}\) and effective sample size for you.

graph TD
    Q[need a posterior / marginal] --> Tree{graph small or tree-like?}
    Tree -->|yes| Exact[exact: variable elimination / belief propagation]
    Tree -->|no| Approx{prefer samples or optimization?}
    Approx -->|samples| MCMC[MCMC: Gibbs / Metropolis-Hastings]
    Approx -->|optimization| VI[variational inference: maximize ELBO]

Tip

A quick way to pick: if the graph is a chain or a tree, use exact message passing and stop thinking. If it has loops but you want guarantees as samples accumulate, reach for MCMC. If you need speed and scale and can tolerate a biased-but-fast answer, use variational inference. Most deep probabilistic models live in that last box.

13.8 — Variational Inference & EM

For most interesting models, the posterior \(P(\text{hidden} \mid \text{data})\) is intractable — its normalizing constant involves a sum or integral you cannot compute. Two workhorses tackle this: EM, when you have latent variables but only want point-estimate parameters, and variational inference, when you want an approximate posterior distribution.

Expectation–Maximization (EM) finds maximum-likelihood parameters in the presence of hidden variables by alternating two steps, each of which can only improve the fit:

E-step: with the current parameters held fixed, compute the posterior over the hidden variables — soft guesses of the missing information (e.g. which cluster each point probably came from).
M-step: with those soft assignments held fixed, update the parameters to maximize the expected log-likelihood (re-estimate each cluster’s mean and variance).

graph LR
    Init[initialize params] --> E["E-step: P(hidden | data, params)"]
    E --> M["M-step: update params"]
    M --> Conv{converged?}
    Conv -->|no| E
    Conv -->|yes| Done[done]

EM is exactly how you fit a Gaussian mixture (§13.10); it is soft k-means. It never decreases the likelihood, but it can get stuck in local optima — so restart it from several initializations and keep the best. In scikit-learn the whole E/M loop is hidden behind .fit:

import numpy as np
from sklearn.mixture import GaussianMixture
X = np.r_[np.random.randn(100,2), np.random.randn(100,2)+5]   # two blobs
gm = GaussianMixture(n_components=2, n_init=10).fit(X)        # EM with 10 restarts
print(gm.means_.round(1))                                     # recovered blob centers
print(gm.predict_proba(X[:1]).round(3))                       # soft assignment = E-step

Variational inference (VI) is more ambitious: it approximates the whole intractable posterior \(p(z \mid x)\) with a simpler, tractable distribution \(q(z)\) (say, a fully-factorized Gaussian), then tunes \(q\) to be as close as possible to the true posterior. “Close” means small KL divergence \(\mathrm{KL}(q \,\|\, p)\). We can’t minimize that directly (it hides the same intractable term), so we instead maximize the ELBO — the Evidence Lower BOund:

\[\log p(x) = \underbrace{\mathbb{E}_{q}\!\big[\log p(x, z)\big] - \mathbb{E}_q\!\big[\log q(z)\big]}_{\text{ELBO}} + \underbrace{\mathrm{KL}\!\big(q(z)\,\|\,p(z\mid x)\big)}_{\ge 0}\]

In words: the (fixed) evidence equals a bound we can compute plus a leftover gap measuring how wrong our approximation \(q\) is; shrink the gap and the bound rises to meet the evidence. Also written: rearranged, \(\text{ELBO} = \log p(x) - \mathrm{KL}(q\|p(\cdot\mid x))\), which makes the bound explicit — the ELBO is always below \(\log p(x)\) by exactly the KL gap.

Because the KL term is non-negative, the ELBO is a lower bound on the log-evidence \(\log p(x)\), tight exactly when \(q\) equals the true posterior. So pushing the ELBO up simultaneously drags \(q\) toward the posterior and tightens our estimate of the evidence.

Picture a fixed ceiling — the evidence \(\log p(x)\), which never moves — and the ELBO rising toward it. Whatever space is left between them is the KL gap, your approximation error. Optimizing \(q\) raises the floor; the ceiling stays put.

\[\text{ELBO} = \underbrace{\mathbb{E}_q[\log p(x \mid z)]}_{\text{reconstruction: fit the data}} - \underbrace{\mathrm{KL}(q(z)\,\|\,p(z))}_{\text{stay near the prior}}\]

“Explain the data, but don’t stray from the prior” — this is the literal objective inside the variational autoencoder (see the Generative Models chapter). VI turns inference into optimization, which is why it scales to deep models where sampling-based methods crawl.

Tip

EM is the special case of VI in which you let \(q\) be the exact posterior in the E-step. Both are coordinate ascent on the same ELBO: the E-step (or “tune \(q\)”) raises the bound by moving \(q\); the M-step raises it by moving the parameters. Same staircase, climbed two legs at a time.

13.9 — Gaussian Processes

Everything so far put distributions over discrete states or parameters. A Gaussian Process (GP) does something stranger and lovelier: it puts a distribution over entire functions. Instead of fitting one curve to your data, a GP maintains a probability over all plausible curves at once — and reports its own uncertainty, which widens far from data and pinches tight where you have observations.

The defining property: any finite set of function values \([f(x_1), \dots, f(x_n)]\) is jointly multivariate Gaussian. A GP is fully specified by a mean function \(m(x)\) (usually taken as 0) and a covariance/kernel function \(k(x, x')\):

\[f(x) \sim \mathcal{GP}\big(m(x),\, k(x, x')\big)\]

The kernel is the soul of the GP — it encodes how similar two inputs’ outputs ought to be. The common RBF (squared-exponential) kernel,

\[k(x, x') = \sigma^2 \exp\!\Big(-\frac{(x-x')^2}{2\ell^2}\Big),\]

In words: two outputs are correlated by how close their inputs are; the correlation falls off smoothly the farther apart the inputs sit, vanishing for distant points. Also written: for vector inputs, \(k(\mathbf x,\mathbf x') = \sigma^2\exp\!\big(-\tfrac{1}{2}\|\mathbf x-\mathbf x'\|^2/\ell^2\big)\) — the squared Euclidean distance in place of the scalar gap.

says “nearby inputs give similar outputs,” with the lengthscale \(\ell\) tuning how fast that correlation decays (small \(\ell\) = wiggly functions, large \(\ell\) = smooth ones). Conditioning the joint Gaussian on the observed points yields a closed-form posterior mean and variance at any new \(x_*\) — no iterative training, just linear algebra.

A concrete feel for \(\ell\): with \(\sigma^2 = 1\) and \(\ell = 1\), two inputs one unit apart have correlation \(\exp(-1^2 / 2) \approx 0.61\); two units apart, \(\exp(-4/2) \approx 0.14\); three units apart, \(\exp(-9/2) \approx 0.011\) — effectively nothing. So at \(\ell = 1\) the GP “forgets” a point’s influence within about three units. Shrink \(\ell\) to \(0.5\) and that same three-unit gap drops to \(\exp(-9/0.5) \approx 10^{-8}\): the curve is now free to wiggle wildly between points that close. This is exactly why GPs are the engine behind hyperparameter tuning services and A/B-test sample-efficient experimentation — places where each data point is expensive and you must squeeze a calibrated guess out of a handful of them.

Here is what a GP posterior looks like: a mean line threading the data, wrapped in an uncertainty band that balloons between and beyond observations.

The band is the selling point. A GP doesn’t merely predict — it tells you how much to trust each prediction. That makes GPs the engine of Bayesian optimization (where should I sample next? where uncertainty is high and the mean looks promising), and the gold standard for small-data regression with calibrated error bars.

import numpy as np
def rbf(a,b,l=1.0,s=1.0):
    d = a[:,None]-b[None,:]
    return s*np.exp(-d**2/(2*l**2))
X = np.array([-2.,0.,2.]); y = np.sin(X)      # 3 noise-free observations
Xs = np.linspace(-3,3,50)
K   = rbf(X,X) + 1e-8*np.eye(3)               # train-train covariance
Ks  = rbf(Xs,X)                               # test-train covariance
mu  = Ks @ np.linalg.solve(K, y)              # posterior mean at Xs
var = rbf(Xs,Xs).diagonal() - (Ks@np.linalg.solve(K,Ks.T)).diagonal()
# mu = predictions, sqrt(var) = the uncertainty band half-width

scikit-learn wraps the same algebra plus automatic kernel-hyperparameter tuning (it learns \(\ell\) and \(\sigma\) by maximizing the marginal likelihood):

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C
import numpy as np

X = np.array([[-2.],[0.],[2.]]); y = np.sin(X).ravel()
gp = GaussianProcessRegressor(kernel=C()*RBF(length_scale=1.0)).fit(X, y)
mu, std = gp.predict(np.linspace(-3,3,50).reshape(-1,1), return_std=True)
# std is the calibrated error bar — wide between/beyond the 3 training points

Warning

The catch is cost: GP inference inverts an \(n \times n\) kernel matrix, which is \(O(n^3)\) time and \(O(n^2)\) memory. Past a few thousand points, exact GPs become impractical and you need sparse / inducing-point approximations. GPs shine on small-to-medium data where honest uncertainty matters — not on millions of rows.

13.10 — Latent Variable Models

A latent variable is a quantity you never observe but that explains the data you do — a hidden cause. Positing one often turns a tangled distribution into a simple one: messy data becomes simple-given-the-hidden-cause. This is the unifying idea behind a whole family of models; they differ only in what the hidden variable is.

The cleanest example is the Gaussian Mixture Model (GMM). The data looks like several overlapping blobs; the latent variable is which blob (component) each point secretly belongs to. The generative story: pick a component \(z\) with probability \(\pi_z\), then draw the point from that component’s Gaussian. You never see \(z\) — you fit it with EM (§13.8), which softly assigns each point to components and re-estimates the Gaussians, round and round.

graph TD
    Z["z: hidden component (which blob?)"] --> X["x: observed point"]
    pi["mixing weights π"] --> Z
    theta["component means/covariances"] --> X

The same template, with a different choice of latent variable, gives a whole zoo of models:

Model	Latent variable (hidden cause)	Observed data
Gaussian Mixture	which cluster a point belongs to	continuous points
Topic model (LDA)	which topics a document mixes	words in documents
Factor analysis / PPCA	a few continuous latent factors	many correlated features
Hidden Markov Model	the hidden state at each step	the emissions
Variational autoencoder	a latent code vector	images, etc.

Latent Dirichlet Allocation (LDA) treats each document as a mixture of a few hidden topics, and each topic as a distribution over words — a workhorse of classical natural language processing. The latents — the document’s topic proportions, and each word’s topic assignment — are never observed; you infer them (by EM-like or variational methods) and out fall interpretable themes (“sports,” “finance”) that nobody labeled. Factor analysis assumes a handful of unseen continuous factors drive many correlated measurements — the statistical ancestor of PCA, differing in that it is an explicit probabilistic generative model with a noise term.

Plate notation: drawing models with repetition

Real latent-variable models have many data points, each with its own copy of a latent variable — drawing 10,000 identical \(z \to x\) subgraphs is hopeless. Plate notation is the textbook shorthand: wrap the repeated nodes in a rectangle (a “plate”) labeled with how many times it repeats. Everything inside the box is duplicated; everything outside is shared.

graph TD
    pi["π (shared)"] --> z["z_n"]
    theta["θ (shared)"] --> x["x_n"]
    z --> x
    subgraph plate["repeat for each of N data points"]
      z
      x
    end

The intuition is a cookie cutter: the plate is the cutter stamped \(N\) times, and only the shared parameters (\(\pi\), \(\theta\)) live outside it. Reading the GMM off this picture: the global mixing weights \(\pi\) and component parameters \(\theta\) sit outside the plate (one set for everyone), while each point’s hidden component \(z_n\) and observation \(x_n\) sit inside (one per point). The same convention draws LDA (documents are an outer plate, words an inner plate nested inside) and every Bayesian hierarchical model — it is the lingua franca of papers in this field.

Here is a single GMM E-step “responsibility” — the soft assignment that is the latent posterior. It is just Bayes’ rule over the hidden component:

\[r_{nk} = P(z_n{=}k \mid x_n) = \frac{\pi_k\,\mathcal{N}(x_n \mid \mu_k, \sigma_k)}{\sum_{j}\pi_j\,\mathcal{N}(x_n \mid \mu_j, \sigma_j)}.\]

In words: how much point \(n\) “belongs” to component \(k\) is that component’s prior-weighted likelihood, divided by the same quantity summed over all components. Also written: the softmax of log-scores, \(r_{nk} = \mathrm{softmax}_k\big(\log\pi_k + \log\mathcal{N}(x_n\mid\mu_k,\sigma_k)\big)\).

import numpy as np
def gauss(x,m,s): return np.exp(-(x-m)**2/(2*s**2))/(s*np.sqrt(2*np.pi))
x = 1.3
pi  = [0.5, 0.5]                       # mixing weights
mu  = [0.0, 2.0]; sd = [1.0, 1.0]      # two components
r = [pi[k]*gauss(x,mu[k],sd[k]) for k in range(2)]
r = np.array(r)/sum(r)                 # responsibilities (posterior over hidden z)
print(r.round(3))                      # e.g. [0.166 0.834] -> point likely from comp 1

And the topic-model cousin, LDA, is a one-liner in scikit-learn — feed it a word-count matrix and out come latent topics nobody labeled:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs = ["stocks bonds market market", "goal match team team goal",
        "market bonds yield", "team coach match goal"]
X = CountVectorizer().fit_transform(docs)
lda = LatentDirichletAllocation(n_components=2, random_state=0).fit(X)
print(lda.transform(X).round(2))       # each doc's topic mixture (the latent z)

That responsibility vector — the posterior over the hidden cause — is the thread tying this section back through the whole chapter: it is a graphical-model inference (§13.2, §13.7), computed in the E-step of EM (§13.8), over a latent variable. The same machinery, applied to sequences, gives HMMs; applied to functions, GPs; applied to deep nets, VAEs.

Warning

Latent variable models suffer from identifiability and label-switching issues: the components of a mixture can be permuted without changing the likelihood, so “component 1” has no fixed meaning across runs. Don’t over-interpret a single fit, and don’t trust latent indices to be stable — interpret the partition of the data, not the label number attached to it.

13.11 — Variable Elimination: Inference as Sums and Products

Imagine you have a joint distribution over many variables, factored into small pieces, and you want one marginal — say \(P(D)\). The brute-force route is to write the full joint as a giant table and sum out everything else. That table has \(2^n\) rows for \(n\) binary variables, so it dies fast. Variable elimination is the simple observation that you don’t have to build the table at all: you can push the summations inside the product, summing out one variable at a time, because most factors don’t mention most variables.

The intuition is the distributive law. To compute \(\sum_b \sum_c f(a,b)\,g(b,c)\) you do not enumerate all \((b,c)\) pairs against \(a\); you note that \(a\) only appears in \(f\), so

\[\sum_b \sum_c f(a,b)\,g(b,c) = \sum_b f(a,b)\Big(\sum_c g(b,c)\Big).\]

In words: don’t drag \(f\) through the inner sum it has nothing to do with — pull it outside and sum the small piece first. Also written: the factored-out inner sum is itself a new factor, \(g'(b) = \sum_c g(b,c)\), leaving \(\sum_b f(a,b)\,g'(b)\) — one fewer variable.

The inner sum over \(c\) produces a smaller factor over \(b\) alone, which you then fold into the next sum. Each elimination replaces “multiply everything mentioning this variable, then sum it out” with one new factor over the neighbors of the eliminated variable.

The algorithm, stated as factors. A graphical model gives a set of factors \(\{\phi_1,\dots,\phi_m\}\) (conditional probability tables in a Bayes net, clique potentials in an MRF), and the joint is their product (times a normalizer). To eliminate variable \(X\):

Collect every factor that mentions \(X\) into a bucket.
Multiply them together into one product factor \(\psi\).
Sum \(\psi\) over \(X\) to get a new factor \(\tau\) over \(X\)’s remaining neighbors.
Remove the bucketed factors, add \(\tau\).

Repeat for every variable you want gone; multiply the survivors and normalize.

flowchart LR
  A["factors mentioning X"] --> B["multiply → ψ(X, neighbors)"]
  B --> C["sum out X → τ(neighbors)"]
  C --> D["replace bucket with τ"]
  D --> E{"more vars<br/>to eliminate?"}
  E -- yes --> A
  E -- no --> F["multiply survivors,<br/>normalize"]

Tip

The whole method is just the distributive law applied repeatedly: \(\sum (ab) = a \sum b\) whenever \(a\) doesn’t depend on the summation index. “Push sums in as far as they go.”

13.12 — A Worked Elimination on a Tiny Chain

Take a four-node chain \(A \to B \to C \to D\), all binary. The joint factors as

\[P(A,B,C,D) = P(A)\,P(B\mid A)\,P(C\mid B)\,P(D\mid C).\]

We want \(P(D)\), which means eliminating \(A\), then \(B\), then \(C\). Use these tiny numbers:

factor	values
\(P(A)\)	\(P(a_0)=0.6,\ P(a_1)=0.4\)
\(P(B\mid A)\)	\(P(b_0\mid a_0)=0.7,\ P(b_0\mid a_1)=0.2\)
\(P(C\mid B)\)	\(P(c_0\mid b_0)=0.8,\ P(c_0\mid b_1)=0.3\)
\(P(D\mid C)\)	\(P(d_0\mid c_0)=0.9,\ P(d_0\mid c_1)=0.5\)

Eliminate \(A\). Only \(P(A)\) and \(P(B\mid A)\) mention \(A\). Multiply and sum:

\[\tau_1(B) = \sum_A P(A)\,P(B\mid A).\]

\[\tau_1(b_0) = 0.6(0.7) + 0.4(0.2) = 0.42 + 0.08 = 0.50,\qquad \tau_1(b_1) = 0.50.\]

This \(\tau_1(B)\) is just the prior marginal \(P(B)\) — here uniform.

Eliminate \(B\). Factors mentioning \(B\) are now \(\tau_1(B)\) and \(P(C\mid B)\):

\[\tau_2(C) = \sum_B \tau_1(B)\,P(C\mid B).\]

\[\tau_2(c_0) = 0.50(0.8) + 0.50(0.3) = 0.55,\qquad \tau_2(c_1) = 0.50(0.2)+0.50(0.7) = 0.45.\]

Eliminate \(C\). Factors mentioning \(C\) are \(\tau_2(C)\) and \(P(D\mid C)\):

\[P(D) = \sum_C \tau_2(C)\,P(D\mid C).\]

\[P(d_0) = 0.55(0.9) + 0.45(0.5) = 0.495 + 0.225 = 0.72,\qquad P(d_1) = 0.28.\]

import numpy as np
# CPTs as arrays: row = parent value, col = child value (index 0/1)
PA  = np.array([0.6, 0.4])
PBA = np.array([[0.7, 0.3], [0.2, 0.8]])   # P(B|A)
PCB = np.array([[0.8, 0.2], [0.3, 0.7]])   # P(C|B)
PDC = np.array([[0.9, 0.1], [0.5, 0.5]])   # P(D|C)

t1 = PA @ PBA      # sum out A  -> P(B)
t2 = t1 @ PCB      # sum out B  -> factor over C
PD = t2 @ PDC      # sum out C  -> P(D)
print(PD)          # [0.72 0.28]

Each elimination is one matrix–vector product because the chain is so thin. The cost is set by the largest factor ever built, not by the number of variables — on a chain that factor never exceeds two variables, so inference is linear in chain length.

13.13 — Elimination Order and Treewidth: Why Some Graphs Are Hard

The chain was cheap because every intermediate factor touched at most two variables. On a denser graph the order in which you eliminate matters enormously, because summing out a variable connects all its neighbors together — they now share a new factor \(\tau\) even if they were never directly linked before. These added edges are called fill-in edges, and they make later factors bigger.

Picture eliminating the center of a star versus eliminating a leaf. Knock out a leaf and nothing happens to the rest. Knock out the hub first and every former neighbor becomes mutually entangled in one giant factor over all of them — exponential in the number of spokes.

eliminate leaf first small factors

eliminate hub first one big factor (fill-in)

The cost of an elimination order is governed by the width: one less than the size of the largest factor (largest clique) ever created. The treewidth of a graph is the width of its best possible order — the minimum over all orderings of that largest-clique size, minus one. Inference by elimination runs in time and space exponential in the treewidth:

\[\text{cost} \;\approx\; O\!\big(m \cdot k^{\,w+1}\big),\]

where \(m\) is the number of variables, \(k\) the number of states each, and \(w\) the treewidth. A chain or tree has treewidth \(1\), so inference is linear. An \(n\times n\) grid has treewidth \(n\), so exact inference becomes exponential — which is exactly why image-grid MRFs fall back on the approximate methods covered earlier.

Finding the optimal order is itself NP-hard, so in practice people use cheap greedy heuristics: min-degree (next eliminate the variable with the fewest current neighbors) and min-fill (next eliminate the variable that adds the fewest fill-in edges). Neither is optimal, but both usually find orders close to the treewidth on real graphs.

Warning

Treewidth is a property of the undirected graph structure, not of the numbers in the tables. No clever choice of potentials rescues a high-treewidth graph; if the structure is a dense grid, exact elimination is hopeless regardless of how the probabilities are set. Moralize a Bayes net (marry co-parents, drop directions) before reading off its treewidth.

13.14 — The Junction Tree: Reusing Work Across Many Queries

Plain variable elimination answers one query. Ask for \(P(D)\) and then \(P(B)\) and you redo most of the sums. The junction tree (also called the clique tree or join tree) is the data structure that caches the shared work, so that after one inward-and-outward sweep you can read off the marginal of every variable. It is variable elimination organized so the intermediate factors become permanent nodes you can pass messages along in both directions.

A junction tree is built in four steps from the model graph:

flowchart TB
  A["Bayes net / MRF"] --> B["Moralize:<br/>marry co-parents,<br/>drop arrow directions"]
  B --> C["Triangulate:<br/>add fill-in edges so every<br/>cycle >3 has a chord"]
  C --> D["Find maximal cliques<br/>of the triangulated graph"]
  D --> E["Connect cliques into a tree<br/>(max-weight spanning tree<br/>on shared-variable counts)"]

The cliques become the tree’s nodes; each edge carries a separator, the set of variables shared by the two cliques it joins. The construction guarantees the running intersection property: if a variable appears in two cliques, it appears in every clique on the path between them. This property is what makes local message passing globally correct — a variable’s evidence can never “skip” a clique and contradict itself elsewhere.

For our chain \(A\to B\to C\to D\), moralizing changes nothing (no co-parents), and it is already triangulated. The maximal cliques are the edges \(\{A,B\}\), \(\{B,C\}\), \(\{C,D\}\), joined through separators \(\{B\}\) and \(\{C\}\):

The largest clique has two variables, matching the chain’s treewidth of \(1\) (width = clique size \(-1\)). A junction tree’s cliques are exactly the factors elimination would have built under a good order — the tree just stores them so they can be reused.

Warning

The junction tree does not dodge the treewidth wall. Triangulation can only add edges, so the largest clique it produces is at least the treewidth — on a dense grid the cliques are huge and the tree is just as exponential as plain elimination. The junction tree saves you redundant work across many queries; it never makes a hard graph easy.

13.15 — Belief Propagation on the Junction Tree

Once the tree exists, inference is a two-phase message-passing schedule, sometimes called the Hugin or Shafer–Shenoy algorithm and structurally identical to belief propagation. Each clique starts holding the product of the factors assigned to it (its initial potential). Pick any clique as root. Then:

Collect (inward): every leaf sends a message toward the root; each clique waits until it has heard from all its children, then sends upward. A message from clique \(C_i\) to neighbor \(C_j\) across separator \(S\) is the clique’s potential, multiplied by all incoming messages except \(C_j\)’s, then marginalized down to the separator variables:

\[m_{i\to j}(S) = \sum_{C_i \setminus S} \psi_i \prod_{k \neq j} m_{k\to i}.\]

In words: to tell neighbor \(j\) what you believe, combine your own evidence with everything every other neighbor told you, then squeeze the result down to just the shared separator variables. Also written: \(m_{i\to j} = \big(\sum_{C_i\setminus S}\text{belief}(C_i)\big) / m_{j\to i}\). Plainly: take your full current belief, divide out what \(j\) already told you (so you don’t echo \(j\)’s own evidence back at it), and shrink to the separator. This is the Hugin “absorb” trick — it skips rebuilding the product from scratch.

Distribute (outward): once the root has everything, messages flow back down the same edges by the same rule.

After both sweeps, each clique’s potential times all its incoming messages equals the (unnormalized) joint marginal over that clique’s variables. Normalize, and you have every clique marginal at once — and any single-variable marginal by summing within any clique that contains it.

flowchart LR
  subgraph Collect["Collect — inward to root [CD]"]
    AB1["AB"] -- "m(B)" --> BC1["BC"]
    BC1 -- "m(C)" --> CD1["CD = root"]
  end
  subgraph Distribute["Distribute — outward from root"]
    CD2["CD"] -- "m'(C)" --> BC2["BC"]
    BC2 -- "m'(B)" --> AB2["AB"]
  end

Worked sweep on the chain. Assign \(P(A)P(B\mid A)\) to clique \(\{A,B\}\), \(P(C\mid B)\) to \(\{B,C\}\), \(P(D\mid C)\) to \(\{C,D\}\). Root at \(\{C,D\}\).

Collect, from \(\{A,B\}\) upward — marginalize \(A\) out of its potential:

\[m_{AB\to BC}(B) = \sum_A P(A)P(B\mid A) = \tau_1(B) = (0.5,\ 0.5).\]

Then \(\{B,C\}\) multiplies in that message and marginalizes \(B\):

\[m_{BC\to CD}(C) = \sum_B \tau_1(B)\,P(C\mid B) = \tau_2(C) = (0.55,\ 0.45).\]

The root \(\{C,D\}\) multiplies its own potential by this and sums out \(C\) to get \(P(D) = (0.72, 0.28)\) — identical to the elimination answer, because the collect phase is variable elimination toward the root. The difference is the distribute phase: pushing messages back out gives \(P(C)\), \(P(B)\), and \(P(A)\) from the same tree without restarting. One tree, two sweeps, every marginal.

Tip

On a tree-structured model the junction tree’s two sweeps are exact and cost \(O(m\,k^{w+1})\) — the same bound as one elimination, but amortized across all marginals. Run the identical message rule on a graph with loops and you get loopy belief propagation, which is no longer exact (messages double-count evidence around cycles) but is often a good, cheap approximation — the bridge to the approximate-inference methods covered earlier in this chapter.

13.16 — Quick reference

Term / formula	Meaning in one line	When / why it matters
Factorization \(P(\mathbf x)=\prod_i P(X_i\mid\text{pa}_i)\)	Joint = product of local conditional tables	The whole point of a Bayes net: turns \(2^n\) entries into a few small CPTs
d-separation	Graphical test for conditional independence	Read independencies off the DAG without touching any number
Collider / explaining away	Observing a common effect couples its causes	The one trap: conditioning on a child opens a path, not closes it
Markov blanket	A node’s neighbors (parents, children, co-parents)	The only nodes you need to resample a variable in Gibbs / MRFs
Partition function \(Z\)	Normalizer summing the product over all states	Why undirected (MRF) inference is hard; Bayes nets avoid it
Naive Bayes \(\hat y=\arg\max_y P(y)\prod_i P(x_i\mid y)\)	Class node fanning to independent features	Fast, strong baseline for bag-of-words text
Markov property \(P(s_t\mid s_{1:t-1})=P(s_t\mid s_{t-1})\)	Present screens off the past	Foundation of Markov chains, HMMs, MCMC
Forward \(\alpha_t(j)\)	\(P(o_{1:t},s_t{=}j)\) by left-to-right DP	Likelihood of an HMM observation sequence
Viterbi	Forward recursion with \(\max\) + backpointers	Single globally-best hidden path (not per-step argmax)
Baum–Welch	EM specialized to HMMs	Learn \(A,B,\pi\) from unlabeled emissions
CRF \(P(\mathbf y\mid\mathbf x)\)	Discriminative, undirected counterpart to HMM	Sequence labeling with rich overlapping features
Variable elimination	Push sums inside the product, one var at a time	Exact inference; cost set by largest factor built
Treewidth \(w\)	Min over orders of (largest clique \(-1\))	Exact inference is \(O(m\,k^{w+1})\) — high \(w\) = hopeless
Junction tree	Clique tree caching elimination’s factors	Two sweeps give every marginal at once
MCMC / Gibbs	Random walk whose stationary dist = posterior	Approximate inference when graph is too tangled
\(\hat R\) (Gelman–Rubin)	Within- vs between-chain variance ratio	\(\approx 1.0\) = chains converged; \(\gg 1\) = keep sampling
ELBO \(\le\log p(x)\)	Tractable lower bound, gap = \(\mathrm{KL}(q\\|p)\)	EM and VI both climb it; tight when \(q\) = true posterior
EM (E-step / M-step)	Soft-assign hidden vars, then re-fit params	MLE with latent variables; how GMMs are fit
GP kernel \(k(x,x')\)	Encodes output similarity vs input distance	Distribution over functions with calibrated uncertainty, \(O(n^3)\)
Plate notation	Box = “repeat \(N\) times” for shared structure	Compact drawing of GMM, LDA, hierarchical models

13.17 — Key takeaways

A graph factorizes a giant joint distribution into small local pieces: directed (Bayesian networks, parent→child CPTs) or undirected (Markov random fields, clique potentials).
Conditional independence is read off the graph: by d-separation in directed models (the collider / “explaining away” pattern is the trap — observing a common effect couples its causes) and by simple graph separation given neighbors in undirected ones.
Bayes-net inference = sum out the unknowns, then normalize; undirected models pay an extra price for the global partition function \(Z\).
Naive Bayes is the simplest useful Bayes net — a class node fanning out to conditionally-independent features — and still a strong, fast text-classification baseline; plate notation is the shorthand for drawing such repeated structure compactly.
MCMC needs burn-in and convergence checks (\(\hat{R}\approx 1\)); never trust a single un-diagnosed chain.
HMMs model hidden state sequences: forward scores evidence, forward–backward smooths per-step beliefs, Viterbi finds the single best path — all dynamic programming, run in log space; Baum–Welch (EM) learns the parameters.
CRFs are the discriminative, undirected counterpart to HMMs: they model \(P(\mathbf{y}\mid\mathbf{x})\) directly, allowing rich overlapping features — better for labeling, but unable to generate inputs.
Inference divides into exact (variable elimination, belief propagation — for trees) and approximate (MCMC / Gibbs sampling, variational inference — for everything else).
EM and variational inference both climb the ELBO, a lower bound on the evidence; EM does exact-posterior E-steps, VI approximates the posterior with a tractable \(q\).
Gaussian Processes are distributions over functions defined by a kernel; their gift is calibrated uncertainty that widens away from data, at \(O(n^3)\) cost.
Latent variable models (GMM, LDA, factor analysis, HMM, VAE) all share one template — posit a hidden cause that makes the data simple, then infer it.

13.18 — See also

Probability & Statistics — the Bayes’ rule, distributions, and KL divergence this chapter builds on.
Dimensionality Reduction — PCA and factor analysis as latent-factor models.
Clustering & Unsupervised Learning — k-means as the hard-assignment cousin of the GMM/EM here.
Recurrent & Sequence Models — the neural successors to HMMs and CRFs for sequence labeling.
Generative Models — variational autoencoders, where the ELBO and latent variables reappear in deep form.
Natural Language Processing — where HMMs, CRFs, and topic models were workhorses for tagging and document modeling.
Causal Inference — directed graphs reinterpreted as causal structure, with interventions and d-separation.
Time Series & Forecasting — Gaussian processes and state-space models for sequential data with uncertainty.

↪ The thread continues → Chapter 14 · 🧠 Neural Networks (Core)

Graphical models reason with hand-designed structure; the next leap lets the machine learn its own representations from raw data, through layers of artificial neurons.

📖 All chapters | ← 12 · 🎯 Model Evaluation & Tuning | 14 · 🧠 Neural Networks (Core) →