Chapter 36 — 🔍 Explainable AI & Interpretability

📖 All chapters | ← 35 · 🧬 Evolutionary Computation & Metaheuristics | 37 · 🧷 Causal Inference →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Explainable AI (XAI) is the study of how to make a model’s behaviour understandable to humans — not just what it predicts, but why. As models grow from a tidy linear regression to a billion-parameter transformer, the gap between accuracy and understanding widens, and that gap is where trust, debugging, regulation, and fairness all live. This chapter sits at the entrance to the Responsible AI part of the roadmap: interpretability is the lens through which every later concern (fairness, safety, causality) is actually inspected.

🧭 In context: Responsible AI & Frontier · used to inspect, debug, audit, and justify model predictions · the one key idea — an explanation is a separate model of the model, useful only as far as it is faithful.

💡 Remember this: An explanation is a separate, approximate model of the model — only trust it as far as you have tested that it faithfully reflects what the model actually does.

36.1 — Why explainability matters (trust, debugging, regulation, fairness)

Before any technique, ask the prior question: why bother explaining at all? A model that scores 0.99 AUC on a held-out set has already “proven” itself — hasn’t it? Not quite. A test score tells you the model is right on average on data that looks like the past. It tells you nothing about why it is right, whether it will stay right when the world shifts, or whether it is right for a reason you would endorse.

Four concrete pressures drive the field.

Trust and adoption. A radiologist will not act on a model that outputs “malignant: 0.87” with no reason. Show them the suspicious region and they can agree or overrule. Explanation is the interface that lets a human keep authority over an automated decision.

Debugging. Models learn shortcuts. The famous case: a pneumonia classifier that keyed off the hospital’s portable X-ray marker — sicker patients were imaged with portable machines, so the metal token in the corner correlated with disease. The model was accurate on the test set and useless in deployment. Only an explanation (the saliency map lit up the corner token, not the lungs) revealed the bug.

Regulation. Laws increasingly require a justifiable decision. The EU’s GDPR implies a “right to an explanation” for automated decisions; the EU AI Act mandates transparency for high-risk systems; the US Equal Credit Opportunity Act requires adverse action notices — a lender must tell a rejected applicant the principal reasons. You cannot ship a black box into these domains.

Fairness. A model can be accurate overall yet discriminatory. Explanation is how you catch a model that leans on a proxy for a protected attribute — e.g. ZIP code standing in for race (AI Ethics, Fairness & Safety makes this auditing concern central). Detecting it requires looking inside the decision, not just at the aggregate score.

flowchart LR
  M[Trained model] --> E[Explanation layer]
  E --> T[Trust / human oversight]
  E --> D[Debugging shortcuts & bugs]
  E --> R[Regulatory compliance]
  E --> F[Fairness auditing]

Tip

A good rule of thumb: the stakes of a wrong decision set the bar for explanation. A movie recommendation needs none; a loan denial, a cancer diagnosis, or a parole decision needs a faithful, human-legible reason.

36.2 — Intrinsic (interpretable models) vs post-hoc explanations

There are two roads to an explanation. Either build a model that is transparent by construction, or take an opaque model and explain it after the fact.

Intrinsic interpretability means the model’s structure is itself the explanation. A linear regression $\hat{y} = w_0 + \sum_j w_j x_j$ tells you directly that increasing $x_j$ by one unit moves the prediction by $w_j$. A short decision tree is a flowchart you can read aloud. A small rule list (“if age > 60 and prior_default then deny”) is its own justification. The model is the explanation — nothing is reconstructed, so nothing can be unfaithful.

Post-hoc explanation means you train whatever is most accurate (a gradient-boosted forest, a deep net) and then fit a second procedure that approximates its behaviour in human terms — LIME, SHAP, saliency maps, all covered below. The danger is baked in: the explanation is an approximation of the model, and an approximation can be wrong.

The two roads, side by side — one path keeps the glass clear all the way through; the other bolts a viewing window onto a sealed box:

A worked contrast. Suppose a bank predicts default. An intrinsic logistic-regression model gives coefficients: income: −0.8, prior_defaults: +1.5, credit_age: −0.3. The reason for any denial is read straight off these weights times the applicant’s values. A post-hoc model — say XGBoost plus SHAP — might be 3 points more accurate, but each denial’s reason is now an estimate of the boosted forest’s local behaviour.

Property	Intrinsic	Post-hoc
Faithfulness	Exact (model = explanation)	Approximate
Model choice	Restricted to simple families	Any model
Typical accuracy	Sometimes lower	Often higher
Examples	Linear/logistic, short trees, rule lists, GAMs	LIME, SHAP, Grad-CAM, counterfactuals

Warning

A frequent error is treating a post-hoc explanation as ground truth about the model. It is a hypothesis about the model’s reasoning. Two different methods (LIME and SHAP) can disagree on the same prediction — at most one can be right, and possibly neither.

36.3 — Global vs local explanations

Explanations answer two different questions. Global explanations describe the model’s behaviour overall — “across all applicants, income matters most.” Local explanations describe one prediction — “this applicant was denied chiefly because of two recent missed payments.”

The distinction matters because the two rarely agree. A feature can be globally weak yet locally decisive: country might barely register across a million users, but for one fraud case a transaction from a high-risk country was the whole story. Conversely a globally dominant feature can be irrelevant to a specific case where it sits at a neutral value.

A practical workflow uses both: a global view to understand the model’s overall logic and catch systemic bias, and a local view to justify and audit individual decisions. Methods split along this axis too — permutation importance and partial dependence are global; LIME is local; SHAP can do both (local per-instance values that aggregate into a faithful global picture).

36.4 — Feature importance and permutation importance

The simplest global question is: which features does the model actually rely on? The most reliable model-agnostic answer is permutation importance: take a feature, shuffle its values across the dataset to destroy its relationship with the target, and measure how much the model’s performance drops. A big drop means the model leaned heavily on that feature; no drop means the model ignored it. (Held-out model evaluation is what makes the measured drop trustworthy.)

The intuition is a stress test. If I scramble income and the model’s error barely changes, income was not doing real work. If I scramble it and accuracy collapses, income was load-bearing.

\[\text{Imp}(j) = \text{Error}\big(\text{model on } X_{\text{shuffled } j}\big) - \text{Error}\big(\text{model on } X\big)\]

In words: a feature’s importance is how much worse the model gets once you scramble that one feature, compared with leaving everything intact. Also written: $\text{Imp}(j) = e_{\text{perm},j} - e_{\text{orig}}$, or as a ratio $e_{\text{perm},j}\,/\,e_{\text{orig}}$ when you want a scale-free “fold increase in error.”

A tiny worked example with real numbers. A model predicts default with baseline error (1 − accuracy) of $0.10$. We shuffle each feature in turn and re-measure:

Feature shuffled	New error	Importance (drop)
`prior_defaults`	0.34	+0.24
`income`	0.19	+0.09
`credit_age`	0.12	+0.02
`favourite_colour`	0.10	0.00

prior_defaults is load-bearing — scrambling it more than triples the error. favourite_colour is dead weight — the model never used it. Now the same computation from scratch:

import numpy as np
def permutation_importance(model, X, y, metric):
    base = metric(y, model.predict(X))          # baseline error
    imps = np.zeros(X.shape[1])
    rng = np.random.default_rng(0)
    for j in range(X.shape[1]):
        Xp = X.copy()
        rng.shuffle(Xp[:, j])                    # break feature j only
        imps[j] = metric(y, model.predict(Xp)) - base
    return imps                                  # large => feature matters

In practice you reach for the battle-tested implementation, which repeats the shuffle several times and reports a mean and spread so you can tell a real effect from sampling noise:

from sklearn.inspection import permutation_importance
r = permutation_importance(model, X_test, y_test,
                           n_repeats=30, random_state=0, scoring="accuracy")
for j in r.importances_mean.argsort()[::-1]:
    print(f"{feature_names[j]:<16} {r.importances_mean[j]:.3f} +/- {r.importances_std[j]:.3f}")

Two caveats decide whether you can trust the numbers. First, correlated features split the credit: if height_cm and height_in both encode the same thing, shuffling one leaves the other to carry the signal, so both look unimportant. Second, prefer to compute importance on a held-out set — importance on the training set can reward overfitting. Note also what permutation importance is not: it measures importance to the model, not a causal effect in the world (see Causal Inference).

Warning

Tree-based “impurity” importance (the default feature_importances_ in scikit-learn) is biased toward high-cardinality features (many unique values) — a random ID column can rank near the top. Permutation importance on held-out data does not suffer this and should be preferred.

36.5 — Partial dependence

Knowing a feature is important does not tell you which way it pushes the prediction, or whether the relationship is a straight line, a curve, or a U-shape. Partial dependence plots (PDP) answer this: they show the average predicted output as one feature is swept across its range, holding the rest as-is.

The mechanism is direct. To get the partial dependence on feature $j$ at value $v$, set every row’s feature $j$ to $v$, predict, and average:

\[\text{PD}_j(v) = \frac{1}{n}\sum_{i=1}^{n} f\big(x^{(i)} \text{ with } x_j \!=\! v\big)\]

In words: pretend everyone in your dataset had the same value $v$ for this one feature, ask the model for all those predictions, and report their average. Also written: $\text{PD}_j(v) = \mathbb{E}_{x_{-j}}\big[\,f(v, x_{-j})\,\big]$, the expectation of the model output over the distribution of all the other features $x_{-j}$.

Sweep $v$ over a grid and plot. The curve is the model’s average response to that feature.

A worked example with three rows. The model is $f(x) = 0.5\,x_{\text{size}} + 2\,x_{\text{bedrooms}}$ (price in $100k), and we want the partial dependence on size at $v = 2$ (thousand sq ft):

rows (size, bedrooms):  (1, 2)  (3, 1)  (2, 3)
force size=2:           (2, 2)  (2, 1)  (2, 3)
predict:                 5.0     3.0     7.0
average  -> PD_size(2) = (5.0 + 3.0 + 7.0) / 3 = 5.0

Repeat for $v = 1, 2, 3, \dots$ and the sequence of averages traces the curve. The same loop in code:

def partial_dependence(model, X, j, grid):
    pd = []
    for v in grid:
        Xv = X.copy(); Xv[:, j] = v             # force feature j to v everywhere
        pd.append(model.predict(Xv).mean())     # average prediction
    return np.array(pd)

The library version draws the curve (and the ICE lines below) in one call:

from sklearn.inspection import PartialDependenceDisplay
PartialDependenceDisplay.from_estimator(
    model, X, features=["square_footage"], kind="both")   # "both" overlays ICE + PDP

In a real model a PDP of predicted house price against square_footage might rise steeply then flatten — the model paid a premium for size up to a point, then stopped caring. That shape is invisible in a single importance number.

The honest limitation: a PDP averages over interactions and assumes feature independence. If square_footage and num_bedrooms are correlated, forcing footage to a huge value while leaving bedrooms small creates impossible houses, and the average can mislead. ICE plots (Individual Conditional Expectation) — one line per row instead of the average — expose this by showing whether individuals behave differently from the mean. When correlation is the worry, Accumulated Local Effects (ALE) plots are the principled fix: instead of forcing impossible combinations, ALE looks only at small local changes within realistic regions of the data and accumulates them, so it stays honest even when features are tangled together.

36.6 — LIME

LIME (Local Interpretable Model-agnostic Explanations) explains one prediction by fitting a simple, transparent model that mimics the black box in the neighbourhood of that one point. The intuition: any curve, however complicated, looks like a straight line if you zoom in far enough. So zoom in on the instance you care about, and fit a line there.

The recipe for a single instance $x$:

Perturb $x$ many times to make nearby samples (mask words, jitter features, toggle pixels).
Label each perturbation with the black box’s prediction.
Weight each perturbation by how close it is to $x$ (near samples matter more).
Fit a sparse linear model on the perturbations, weighted. Its coefficients are the explanation.

Underneath, LIME minimizes a fidelity-plus-simplicity objective:

\[\xi(x) = \arg\min_{g \in G} \; \mathcal{L}\big(f, g, \pi_x\big) + \Omega(g)\]

In words: pick the simple model $g$ that best copies the black box $f$ on points near $x$ (that is the loss $\mathcal{L}$ weighted by the closeness kernel $\pi_x$), while staying as simple as possible (the penalty $\Omega$, e.g. few non-zero coefficients). Also written: $\xi(x) = \arg\min_{g}\;\sum_{z} \pi_x(z)\,\big(f(z) - g(z)\big)^2 + \Omega(g)$ — a weighted least-squares fit over perturbed samples $z$ with a sparsity penalty.

flowchart LR
  X[instance x] --> P[perturb -> neighbours]
  P --> B[black-box labels each]
  B --> W[weight by closeness to x]
  W --> L[fit weighted sparse linear model]
  L --> C[coefficients = local explanation]

For a text classifier that flags an email as spam, LIME might report +0.4 "free", +0.3 "winner", −0.1 "meeting" — the local weights that reconstruct this email’s score. For an image, it highlights the superpixels that pushed the class.

from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=["ham", "spam"])
exp = explainer.explain_instance(email_text, clf.predict_proba, num_features=6)
print(exp.as_list())     # e.g. [('free', 0.41), ('winner', 0.29), ('meeting', -0.11), ...]

LIME’s strengths are that it is model-agnostic and intuitive. Its well-known weakness is instability: because perturbation and weighting involve random sampling and a kernel-width choice, running LIME twice on the same instance can give different explanations. Treat its outputs as indicative, and check robustness before relying on them.

36.7 — SHAP (Shapley values)

SHAP (SHapley Additive exPlanations) brings a guarantee that LIME lacks: it is the unique explanation method satisfying a set of fairness axioms, borrowed from cooperative game theory. Picture the features as players in a game and the prediction as the payout. How should the payout be fairly divided among the players? The Shapley value answers this: a feature’s contribution is its average marginal contribution over every possible order in which features could be added to the model.

The intuition is a fair split. Some features only help in the presence of others; to be fair you must average a feature’s added value across all orderings of who arrived before it. Think of three people building a project together: to pay each fairly you ask, for every order in which they could have joined, how much the project improved the moment that person walked in — then average those gains.

\[\phi_j = \sum_{S \subseteq F \setminus \{j\}} \frac{|S|!\,(|F|-|S|-1)!}{|F|!}\big[f(S \cup \{j\}) - f(S)\big]\]

In words: a feature’s SHAP value is the average, over every coalition of other features that might already be “in the room,” of how much adding this feature changes the prediction. Also written: $\phi_j = \frac{1}{|F|!}\sum_{\pi}\big[f(\text{before}_\pi(j)\cup\{j\}) - f(\text{before}_\pi(j))\big]$ — the plain average of the marginal gain of $j$ taken over all $|F|!$ feature orderings $\pi$.

Don’t let the factorials scare you — they are just bookkeeping. The bracket $f(S \cup \{j\}) - f(S)$ is the one thing that matters: how much the prediction changes the moment feature $j$ joins a group $S$ of features already present. The messy fraction in front is only there to weight each group so that, all told, every possible order of arrival counts equally. Net effect: $\phi_j$ is just “the average bump feature $j$ gives, no matter who got there first.”

This yields the prized additivity property: the SHAP values plus the baseline reconstruct the prediction exactly.

\[f(x) = \mathbb{E}[f] + \sum_{j} \phi_j\]

In words: start from the model’s average output, then add up every feature’s SHAP contribution, and you land exactly on this instance’s prediction — nothing is left over. Also written: $\hat{y}(x) = \phi_0 + \sum_{j=1}^{M}\phi_j$, where $\phi_0 = \mathbb{E}[f]$ is the base value.

That additivity is exactly what a SHAP waterfall plot draws: it starts at the base value and stacks each feature’s push (up or down) until it lands on the final prediction. Watch the bars build, each one nudging the running total toward the answer:

A fully worked two-feature example by hand. Features are defaults ($D$) and income ($I$); the baseline (predict-nothing) value is $f(\varnothing) = 0.10$. The model’s value for each feature subset:

Subset $S$	$f(S)$
$\varnothing$	0.10
$\{D\}$	0.30
$\{I\}$	0.05
$\{D, I\}$	0.32

With two features there are two orderings. Take feature $D$. In order $D$-then-$I$, $D$ is added to $\varnothing$: marginal $= 0.30 - 0.10 = 0.20$. In order $I$-then-$D$, $D$ is added to $\{I\}$: marginal $= 0.32 - 0.05 = 0.27$. Average: $\phi_D = (0.20 + 0.27)/2 = 0.235$. By the same logic $\phi_I = (\,(0.05-0.10) + (0.32-0.30)\,)/2 = (-0.05 + 0.02)/2 = -0.015$. Check additivity: $0.10 + 0.235 - 0.015 = 0.32 = f(\{D,I\})$. The decision is fully and exactly accounted for — defaults pushed up strongly, income nudged down slightly.

# Brute-force Shapley for one feature j (toy: few features only).
from itertools import permutations
def shapley(f, x, baseline, j):
    feats = list(range(len(x)))
    contrib, n = 0.0, 0
    for order in permutations(feats):
        present = []                            # features added before j
        for k in order:
            if k == j: break
            present.append(k)
        xa = baseline.copy()                    # only 'present' set to real values
        for k in present: xa[k] = x[k]
        xb = xa.copy(); xb[j] = x[j]            # now add j
        contrib += f(xb) - f(xa); n += 1        # marginal gain of j
    return contrib / n                          # average over orderings

The exact sum is exponential in the number of features, so practical SHAP uses fast approximations: KernelSHAP (a weighted-linear sampling scheme) for any model, and TreeSHAP (exact and polynomial-time) for tree ensembles. The library wraps both and produces the standard plots:

import shap
explainer = shap.TreeExplainer(model)          # exact + fast for tree ensembles
shap_values = explainer(X_test)
shap.plots.beeswarm(shap_values)               # global view: every dot a local SHAP value
shap.plots.waterfall(shap_values[0])           # local view: one prediction, feature by feature

Aggregating local SHAP values across the dataset gives a faithful global importance ranking — the bridge between local and global from §36.3.

Tip

Prefer SHAP when you need consistency and a guarantee that explanations add up to the prediction. Its summary (beeswarm) plot — each dot a SHAP value, coloured by feature value — is the single most informative one-glance view of a tabular model.

36.8 — Anchors and rule-based local explanations

LIME and SHAP hand you numbers — a weight per feature. But a loan officer or a regulator often wants a rule they can state out loud and trust: “for cases like this one, the model says deny — period.” Anchors (from the same group that built LIME) provide exactly that: an IF-THEN rule that “anchors” the prediction so firmly that, as long as the rule’s conditions hold, the other features almost never change the outcome.

The intuition is a sufficient condition. LIME tells you the slope at a point; an anchor tells you a box around the point inside which the prediction is locked. For a denied applicant an anchor might read:

IF prior_defaults ≥ 2 AND income < $40k THEN predict deny (precision 0.97, coverage 0.12)

Two numbers make the rule trustworthy and honest about its reach:

\[\text{precision}(A) = P\big(f(z) = f(x) \mid z \in A\big), \qquad \text{coverage}(A) = P\big(z \in A\big)\]

In words: precision is how often the model gives the same answer for perturbed inputs that still satisfy the rule (how reliable the rule is), and coverage is what fraction of all inputs the rule even applies to (how broadly useful it is). Also written: $\text{precision}(A) = \mathbb{E}_{z\sim \mathcal{D}(\cdot|A)}[\mathbb{1}\{f(z)=f(x)\}]$ and $\text{coverage}(A) = \mathbb{E}_{z\sim\mathcal{D}}[\mathbb{1}\{z\in A\}]$.

Anchors searches for the shortest rule whose precision clears a threshold (say 0.95), greedily adding conditions and using a bandit-style sampling test to confirm precision without enumerating everything.

from anchor.anchor_tabular import AnchorTabularExplainer
explainer = AnchorTabularExplainer(class_names, feature_names, X_train)
exp = explainer.explain_instance(x, model.predict, threshold=0.95)
print("ANCHOR:", " AND ".join(exp.names()))
print("precision %.2f  coverage %.2f" % (exp.precision(), exp.coverage()))

Where SHAP gives a smooth, additive attribution and LIME gives a local slope, an anchor gives a crisp, high-precision if-then — the form humans audit and act on most easily. The tradeoff is coverage: a very reliable anchor may apply to only a thin slice of cases, and complex boundaries may admit no short high-precision rule at all.

36.9 — Saliency maps / Grad-CAM for vision

For images, the natural question is where in the picture did the model look? This is the core toolkit of Computer Vision interpretability. Saliency maps answer it by computing the gradient of the predicted class score with respect to each input pixel: $\big|\partial\, s_c / \partial\, x_{ij}\big|$. A large gradient means nudging that pixel would most change the score — so that pixel mattered. Raw pixel-gradient saliency is fast but noisy and scattered.

In words: the saliency at a pixel is how sharply the class score would move if you tweaked that pixel a hair — steep sensitivity means the pixel was influential. Also written: $M_{ij} = \left\lvert \nabla_{x_{ij}} s_c(x)\right\rvert$, the absolute value of the input-gradient of the class-$c$ logit.

Grad-CAM (Gradient-weighted Class Activation Mapping) gives a cleaner, more localised heatmap by working at the last convolutional layer rather than raw pixels — the layer where spatial position still survives but features are semantic (“ear”, “wheel”). The steps:

Forward-pass the image; pick the class $c$ to explain.
Backprop $s_c$ to the last conv layer’s feature maps $A^k$, getting gradients.
Average each map’s gradients into a weight $\alpha_k^c = \frac{1}{Z}\sum_{i,j}\frac{\partial s_c}{\partial A^k_{ij}}$ — how important channel $k$ is for class $c$.
Combine and keep positives: $L^c = \text{ReLU}\!\big(\sum_k \alpha_k^c A^k\big)$, then upsample to image size.

For step 4: In words: weight each feature map by how much it helps class $c$, add them up, and throw away the negative parts so only regions that support the class light up. Also written: $L^c_{\text{Grad-CAM}} = \max\!\big(0,\ \sum_k \alpha_k^c A^k\big)$.

A Grad-CAM heatmap is a “hot spot” laid over the photo — here a soft glow gently pulses over the dog’s face, the region the model leaned on. (A healthy result: the heat sits on the animal, not the background.)

flowchart LR
  I[input image] --> CNN[conv layers]
  CNN --> A[last conv maps A_k]
  A --> S[class score s_c]
  S -. backprop .-> G[grad-based weights alpha_k]
  A --> Combine[weighted sum + ReLU]
  G --> Combine
  Combine --> H[upsample -> heatmap overlay]

A minimal PyTorch sketch hooks the last conv layer, backprops the chosen class, and forms the weighted map:

import torch, torch.nn.functional as F
feats, grads = {}, {}
layer = model.layer4[-1]                                   # last conv block (ResNet)
layer.register_forward_hook(lambda m, i, o: feats.update(a=o))
layer.register_full_backward_hook(lambda m, gi, go: grads.update(g=go[0]))

scores = model(img)                                        # img: (1,3,H,W)
c = scores.argmax(1)
model.zero_grad(); scores[0, c].backward()                 # backprop the chosen class

alpha = grads["g"].mean(dim=(2, 3), keepdim=True)          # global-avg-pool the gradients
cam = F.relu((alpha * feats["a"]).sum(1))                  # weighted sum + ReLU
cam = F.interpolate(cam[None], size=img.shape[-2:], mode="bilinear")  # upsample to image
cam = (cam - cam.min()) / (cam.max() - cam.min() + 1e-8)   # normalize to [0,1] heatmap

The result overlays a “hot” region on the original image. This is exactly the tool that exposes shortcut learning: if a “husky vs wolf” classifier’s Grad-CAM lights up the snow in the background instead of the animal, you have caught a model that learned “snow ⇒ wolf.” A vivid debugging win that a confusion matrix would never reveal.

Warning

A pretty heatmap over the right object is necessary but not sufficient evidence the model is correct. Saliency shows where attention fell, not what the model concluded from it — and several saliency methods have failed sanity checks (producing similar maps even when the model’s weights are randomised). Use them as a debugging signal, not proof of soundness.

36.10 — Attention as (imperfect) explanation

Transformers compute attention weights — for each token, a distribution over other tokens saying “how much I read from each.” It is tempting to read these as explanations: the token a head attends to most is “the reason.” Attention is convenient because it is produced for free during the forward pass, and it often looks meaningful (a pronoun attending to its antecedent).

But the field’s consensus, after the “Attention is not Explanation” / “Attention is not not Explanation” debate, is cautious. The core problems:

Attention is not unique. You can often find different attention distributions that yield the same prediction. If many weightings produce the same output, no single one is “the” explanation.
Attention ≠ contribution. A token can receive high attention yet have little effect on the output, because information also flows through value vectors, residual connections, and many layers. High weight does not equal high influence.
Many heads, many layers. Which of the dozens of attention maps is the explanation? Aggregating them (e.g. attention rollout) is itself a modelling choice.

A live attention row — the cells brighten and fade in turn, the way a head spreads its weight over tokens, with no single cell ever clearly “the” reason:

The pragmatic stance: attention is a useful hint about information routing and a fine visualization, but for a faithful per-feature attribution prefer SHAP-style or gradient-based methods that measure actual effect on the output. Attention shows what was looked at, not necessarily what was used.

36.11 — Counterfactual explanations

The other methods answer “why this prediction?” A counterfactual explanation answers a more actionable question: “what would have to change to flip the decision?” It is the most human form of explanation — we naturally reason in “if only” terms.

Formally, given an instance $x$ that the model classifies undesirably (loan denied), find the closest point $x'$ that the model classifies the desired way (approved):

\[x' = \arg\min_{x'} \; d(x, x') \quad \text{subject to}\quad f(x') = \text{desired}\]

In words: find the nearest tweaked version of this case that the model would have decided the other way — the smallest realistic change that flips the outcome. Also written: as a single soft objective, $x' = \arg\min_{x'}\;\big(f(x') - y_{\text{target}}\big)^2 + \lambda\,d(x, x')$, trading off “reach the desired class” against “stay close to the original.”

The distance $d$ keeps the change small and realistic — change as few features as little as possible.

A counterfactual is the smallest nudge that carries a point across the line. Watch the denied point drift just past the boundary into “approved”:

A worked example. An applicant is denied; the model approves when a score crosses $0.5$. Their current feature values and the closest approving point:

Feature	Current	Counterfactual	Change
annual income	$42,000	$48,000	+$6,000
existing loans	2	2	—
credit age (yr)	4	4	—

The output reads like advice: “Your loan was denied. Had your annual income been $6,000 higher, it would have been approved.” That single sentence is more useful to a rejected applicant than any importance bar chart, and it directly satisfies “adverse action notice” style regulation.

# DiCE: diverse counterfactuals for a tabular model.
import dice_ml
d = dice_ml.Data(dataframe=df, continuous_features=["income", "credit_age"], outcome_name="approved")
m = dice_ml.Model(model=clf, backend="sklearn")
cf = dice_ml.Dice(d, m).generate_counterfactuals(x_query, total_CFs=3, desired_class=1)
cf.visualize_as_dataframe()     # 3 nearby "what to change" recipes that flip the decision

Good counterfactuals respect three properties: proximity (close to the original), actionability (don’t tell someone to lower their age or change their race — only mutable features), and plausibility (stay on the data manifold; “income $1M, age 19” is not a credible recommendation). They also connect to fairness: if flipping only a protected attribute flips the decision, that is direct evidence of discrimination.

Tip

Counterfactuals and feature attribution are complementary. Attribution (SHAP) says why the decision was made; the counterfactual says what to do about it. For a person on the receiving end of a decision, the counterfactual is usually what they actually want.

36.12 — Concept-based explanations (TCAV)

Feature attributions speak the model’s language — “pixel (143, 88),” “token 12.” Humans think in concepts — “stripes,” “wrinkles,” “the word urgent.” Concept-based explanations bridge that gap by asking: how much does a human-named concept influence the model’s prediction? The flagship method is TCAV (Testing with Concept Activation Vectors).

The intuition: you teach the probe, not the model, what a concept looks like, then measure whether the model’s decisions move along that concept’s direction. You collect example images of the concept (photos with stripes) and not the concept (random photos), look at their activations in some hidden layer, and find the direction in activation space that separates them. That direction is the Concept Activation Vector $v_C^\ell$ — a learned arrow pointing toward “more striped.”

Then TCAV asks: for a target class (say “zebra”), does nudging activations along $v_C^\ell$ tend to increase the class score? The TCAV score is simply the fraction of class examples for which it does:

\[\text{TCAV}_{C,k}^{\ell} = \frac{\big|\{x \in X_k : \nabla h_{\ell,k}(f_\ell(x)) \cdot v_C^\ell > 0\}\big|}{|X_k|}\]

In words: out of all images of class $k$, what fraction have a class score that goes up when you push their hidden activations in the concept’s direction — that fraction is how much the concept matters to the class. Also written: $\text{TCAV}_{C,k}^{\ell} = \mathbb{E}_{x\in X_k}\big[\mathbb{1}\{\,\partial_{v_C^\ell} s_k(x) > 0\,\}\big]$, the expected indicator of a positive directional derivative along $v_C^\ell$.

A TCAV score near $1.0$ for “stripes” on the “zebra” class means stripes are decisive; near $0.5$ means the concept is irrelevant (its direction is no better than random). The win is that the explanation is stated in a vocabulary a domain expert chose — a dermatologist can ask “does the model use irregular borders to flag melanoma?” and get a number, rather than squinting at a pixel heatmap.

flowchart LR
  Pos[concept examples: striped] --> Act[hidden-layer activations]
  Neg[random examples] --> Act
  Act --> CAV[fit linear probe -> concept vector v_C]
  CAV --> Dir[directional derivative of class score along v_C]
  Dir --> Score[TCAV score = fraction with positive derivative]

The same cautions apply as for probing (§36.13): a concept needs enough clean examples, and you should run a statistical test against random concept sets to confirm a high TCAV score is not noise. Used well, TCAV catches a model relying on a concept it shouldn’t — e.g. a skin-cancer model keying on the ruler marks dermatologists place beside malignant lesions.

36.13 — Mechanistic interpretability and probing for deep nets/LLMs

All the methods so far treat the network as a black box and study its inputs and outputs. Mechanistic interpretability takes the opposite, more ambitious stance: open the box and reverse-engineer the internal algorithm — the specific neurons, attention heads, and weight pathways (circuits) that implement a behaviour. This is an active frontier for Large Language Models. The aspiration is to understand a network the way you understand a decompiled program.

A milder, widely-used cousin is probing. The idea in one line: if you can read a fact off a layer with a simple ruler, the layer must already be storing it. So you freeze the model, grab its hidden activations for a batch of inputs, and train a tiny classifier (the probe) to predict some concept — part of speech, sentiment, whether a chess position is winning — from those activations alone. If the probe scores high, the information was sitting there, ready to be read; if it scores at chance, the layer isn’t holding that concept (at least not in a simple, linear form).

# Probing: can a linear classifier read 'concept' from layer activations?
# acts: (n_examples, hidden_dim) frozen activations; labels: concept tags
from sklearn.linear_model import LogisticRegression
probe = LogisticRegression(max_iter=1000).fit(acts_train, y_train)
print("probe accuracy:", probe.score(acts_test, y_test))
# high => the concept is linearly encoded at this layer; low => it isn't (here)

The field has produced striking concrete results: induction heads in transformers that implement in-context copying (“…A B … A → predict B”), and a real risk — polysemanticity, where a single neuron fires for several unrelated concepts because the network packs more features than it has neurons (superposition). Sparse autoencoders are the current tool for disentangling these into mono-semantic features. Probing has its own trap: a powerful probe might learn the concept itself rather than find it already there — control with a randomized-label baseline to confirm the model, not the probe, holds the information.

A complementary, more causal tool is activation patching (a.k.a. causal tracing): run the model on a clean input, run it again on a corrupted one, then copy a specific activation from the clean run into the corrupted run and see if the correct answer is restored. If patching one head’s output flips the prediction back, that head causally carries the relevant information — much stronger evidence than a correlational probe.

flowchart LR
  Clean["clean run: 'Paris is in ___' → France"] --> Cache[cache activations]
  Corrupt["corrupted run: 'Rome is in ___' → Italy"] --> Patch[patch in one cached head]
  Cache --> Patch
  Patch --> Test{answer flips back to France?}
  Test -->|yes| Causal[that head carries the fact]
  Test -->|no| Skip[head not responsible]

Tip

Where this shows up in practice. Anthropic’s interpretability team used sparse autoencoders on a production LLM to pull out millions of human-nameable features — a “Golden Gate Bridge” feature, a “code with a security bug” feature — and showed that clamping a feature up or down steers the model’s behaviour. That is mechanistic interpretability leaving the lab: not just reading a circuit, but using it as a dial.

flowchart TB
  subgraph blackbox[Treat model as black box]
    A[input/output methods: LIME, SHAP, PDP, saliency]
  end
  subgraph internals[Open the box]
    B[Probing: read concepts from activations]
    C[Mechanistic: circuits, heads, neurons]
    D[Sparse autoencoders: disentangle superposition]
  end
  A -. less faithful, more general .-> internals

36.14 — Evaluating explanations: faithfulness and stability

Every method above produces an explanation — but is the explanation any good? A plausible-looking heatmap or a tidy feature ranking can still be wrong about the model. Because there is rarely a ground-truth “correct explanation,” the field measures explanations by proxy properties you can test directly. Two matter most.

Faithfulness asks whether the explanation reflects what the model actually does. The standard test is deletion / insertion: rank features by the explanation’s importance, then remove (or add) them one by one and watch the prediction. A faithful explanation should make the score drop fast as you delete its top-ranked features.

\[\text{AOPC} = \frac{1}{K+1}\sum_{k=0}^{K}\Big(f(x) - f\big(x_{\setminus \text{top-}k}\big)\Big)\]

In words: the average amount the prediction falls as you progressively delete the features the explanation called most important — a bigger area means a more faithful explanation. Also written: $\text{AOPC} = \langle\, f(x) - f(x_{\setminus \text{top-}k})\,\rangle_k$, the mean prediction drop over the deletion curve.

Stability (robustness) asks whether a tiny, prediction-preserving change to the input produces a similarly tiny change in the explanation. If two near-identical inputs get wildly different explanations, the method is too noisy to trust — this is precisely LIME’s known weakness.

\[\text{instab}(x) = \max_{\;x' :\, \lVert x'-x\rVert \le \epsilon}\; \lVert g(x') - g(x)\rVert\]

In words: the worst-case change in the explanation over all inputs within a small neighbourhood of $x$ — smaller is more stable. Also written: the local Lipschitz constant $\sup_{x'\neq x}\frac{\lVert g(x') - g(x)\rVert}{\lVert x' - x\rVert}$ over the $\epsilon$-ball.

A small worked check for faithfulness in code — delete top-ranked features and confirm the score collapses:

import numpy as np
def deletion_auc(model, x, importances):
    order = np.argsort(importances)[::-1]       # most important first
    xc, scores = x.copy(), [model.predict([x])[0]]
    for j in order:
        xc = xc.copy(); xc[j] = 0.0             # ablate feature j (baseline value)
        scores.append(model.predict([xc])[0])
    return np.mean(scores[0] - np.array(scores))  # larger => more faithful explanation

The lesson loops back to the chapter’s thesis: do not adopt an explanation method on looks alone. Run a deletion test for faithfulness and a perturbation test for stability — an explanation that fails both is decoration, not insight.

36.15 — The accuracy-vs-interpretability tradeoff and the limits of explanations

A folk belief holds that there is an iron law: the more accurate the model, the less interpretable, so you must trade one for the other. The picture is often true — a deep ensemble usually beats a 4-leaf tree — but it is not a law.

The red point matters: interpretable-by-design high-accuracy models exist. GAMs / Explainable Boosting Machines fit a flexible shape per feature ($\hat y = \sum_j f_j(x_j)$) and rival boosted trees on tabular data while staying readable — each $f_j$ is a plottable curve.

In words: a GAM predicts by adding up a separate learned curve for each feature, so you can literally plot and read off how every feature bends the prediction. Also written: $g(\mathbb{E}[y]) = \beta_0 + \sum_j f_j(x_j)\,(+\sum_{j<k} f_{jk}(x_j,x_k))$ — a link function $g$ on a sum of per-feature shape functions, optionally plus a few pairwise interaction terms (the EBM extension).

# Explainable Boosting Machine: glass-box accuracy on tabular data.
from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier().fit(X_train, y_train)
from interpret import show
show(ebm.explain_global())     # per-feature shape curves you can read directly

For many tabular problems the famous tradeoff is a few decimal points wide, and a leading argument (Rudin’s) is that for high-stakes decisions you should reach for an inherently interpretable model first, not explain a black box you didn’t have to use.

Finally, the limits of explanation — the disclaimer the whole chapter has been building toward:

Post-hoc explanations can be unfaithful. They approximate the model; LIME and SHAP can disagree, and saliency methods have failed sanity checks.
Explanations don’t establish causation. “Feature $j$ raised the prediction” describes the model, not the world (Causal Inference handles real causality).
Plausible ≠ correct. A convincing explanation can rationalize a wrong or biased decision, lending false confidence.
They can be gamed. A model can be tuned to produce innocent-looking explanations while behaving badly — “fairwashing.”

An explanation is a tool for a human to reason about a model, not a certificate of its correctness. Use it to form hypotheses and catch bugs; verify those hypotheses with controlled tests.

36.16 — Quick reference

Term / method	What it means	When / why to use
Intrinsic interpretability	Model structure is the explanation (linear, short tree, rule list, GAM)	High-stakes decisions where exact faithfulness beats a few accuracy points
Post-hoc explanation	A second procedure approximates a trained black box	You need the accuracy of a complex model but still owe a reason
Global vs local	Whole-model behaviour vs one prediction	Global to catch systemic bias; local to justify/audit a single decision
Permutation importance	Shuffle a feature, measure error rise: $\text{Imp}(j)=e_{\text{perm},j}-e_{\text{orig}}$	Model-agnostic global ranking; compute on held-out data
PDP / ICE	Average (PDP) or per-row (ICE) predicted output as one feature is swept	See the shape of a feature’s effect, not just its magnitude
ALE	Accumulated local effects within realistic data regions	Correlation-robust replacement for PDP when features are tangled
LIME	Local sparse linear fit to perturbed neighbours of $x$	Quick, intuitive local explanation; check stability (it is noisy)
SHAP	Shapley values: average marginal contribution; $f(x)=\mathbb{E}[f]+\sum_j\phi_j$	When you need consistency and attributions that sum to the prediction
Anchors	High-precision IF-THEN rule that locks the prediction	A rule a regulator/officer can state aloud; report precision + coverage
Saliency / Grad-CAM	Gradient-based heatmap of where a vision model looked	Debug shortcut learning (e.g. snow ⇒ wolf); not proof of correctness
Attention weights	Per-token read distribution from a transformer	A routing hint only — high weight ≠ high influence
Counterfactual	Closest input $x'$ with $f(x')=\text{desired}$	Actionable “what to change” advice; doubles as a fairness probe
TCAV	Fraction of class examples whose score rises along a concept vector	Explain in a human-chosen vocabulary (stripes, irregular borders)
Probing / activation patching	Read a concept off a layer; causally restore a cached activation	Open the box: probing is correlational, patching is causal evidence
Faithfulness (AOPC)	Mean score drop as top-ranked features are deleted	Test whether the explanation matches what the model actually does
Stability	Worst-case explanation change over a small input neighbourhood	Reject methods that flip explanations for near-identical inputs
GAM / EBM	Additive per-feature shape functions: $g(\mathbb{E}[y])=\beta_0+\sum_j f_j(x_j)$	Glass-box accuracy on tabular data — often no real tradeoff

36.17 — Key takeaways

Explainability serves four needs: trust, debugging (catching shortcut learning), regulation, and fairness auditing — and the stakes of a wrong decision set how much you need.
Choose intrinsic (model = explanation, exactly faithful) over post-hoc (approximate) when stakes are high and accuracy permits.
Global methods (permutation importance, PDP) describe the whole model; local methods (LIME, anchors, counterfactuals) explain one prediction; SHAP bridges both.
Permutation importance stress-tests features by shuffling; PDP/ICE show the shape of a feature’s effect; both assume feature independence and can mislead under correlation (ALE is the correlation-robust fix).
SHAP uniquely guarantees the attributions sum to the prediction; LIME is intuitive but unstable; anchors give a crisp high-precision IF-THEN rule.
For vision, Grad-CAM shows where the model looked; TCAV measures whether a human-named concept drove the decision; for transformers, attention is a hint, not a faithful attribution.
Counterfactuals give actionable “what to change” advice and double as fairness probes.
Mechanistic interpretability, probing, and activation patching open the box to find circuits and concepts inside deep nets and LLMs.
Always evaluate an explanation: test faithfulness (deletion/insertion) and stability (robustness to small input changes) before trusting it.
The accuracy–interpretability tradeoff is real but often small; GAMs/EBMs can be both. No explanation proves correctness — verify, don’t trust.

36.18 — See also

Regression — linear/logistic models as the canonical intrinsically interpretable family.
Ensemble Methods — the gradient-boosted trees that post-hoc methods most often explain (and TreeSHAP targets).
Attention & Transformers — the mechanics behind attention-as-explanation.
Convolutional Neural Networks — the feature maps Grad-CAM operates on.
Large Language Models — the target of probing and mechanistic interpretability.
Causal Inference — why model attribution is not real-world causation.
AI Ethics, Fairness & Safety — where interpretability feeds fairness auditing and “fairwashing” risks.
Model Evaluation & Tuning — held-out evaluation underpinning trustworthy permutation importance.

↪ The thread continues → Chapter 37 · 🧷 Causal Inference

Explaining a prediction tells you what the model used; it doesn’t tell you what would actually cause a different outcome. For that you need the harder science of causation.

📖 All chapters | ← 35 · 🧬 Evolutionary Computation & Metaheuristics | 37 · 🧷 Causal Inference →

Subset \(S\)	\(f(S)\)
\(\varnothing\)	0.10
\(\{D\}\)	0.30
\(\{I\}\)	0.05
\(\{D, I\}\)	0.32