Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 37.1 — Why ML patterns are not causal
  • 37.2 — Confounding
  • 37.3 — Simpson’s paradox
  • 37.4 — Potential outcomes and counterfactuals
  • 37.5 — Randomized experiments (A/B tests) as the gold standard
  • 37.6 — Pearl’s causal graphs (DAGs) and the do-operator
  • 37.7 — Backdoor and frontdoor criteria
  • 37.8 — Observational methods
  • 37.9 — Sensitivity analysis: how wrong can the assumption be?
  • 37.10 — Mediation: direct vs indirect effects
  • 37.11 — Uplift and treatment-effect modeling
  • 37.12 — Causal discovery: learning the graph from data
  • 37.13 — Why causal thinking matters for robust, fair, actionable AI
  • 37.14 — Quick reference
  • 37.15 — Key takeaways
  • 37.16 — See also

Chapter 37 — 🧷 Causal Inference

📖 All chapters  |  ← 36 · 🔍 Explainable AI & Interpretability  |  38 · ⚖️ AI Ethics, Fairness & Safety →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

Most of machine learning is a magnificent engine for finding patterns — but a pattern is a statement about what goes together, not about what makes things happen. Causal inference is the discipline of climbing from correlation to cause: figuring out what would happen if you reached in and changed something. It sits at the responsible-and-frontier edge of AI because almost every decision a model is asked to inform — raise this price, give this patient that drug, show this user that ad — is a causal question wearing a predictive costume.

🧭 In context: Responsible AI & Frontier · used to estimate the effect of acting, not just predict outcomes · the one key idea: \(P(Y\mid X)\) (seeing) is not \(P(Y\mid do(X))\) (doing).

💡 Remember this: Predicting an outcome (\(P(Y\mid X)\)) and estimating what an action does to it (\(P(Y\mid do(X))\)) are different problems — to act safely you must adjust for confounders, not just fit the data.

A useful mental picture before we start: think of a doctor versus a detective. A predictor is a detective who reads the clues already on the table and guesses what happened. A causal model is a doctor who must decide what to do next — and a treatment that merely correlates with recovery can still kill the patient if a hidden cause is pulling the strings. This chapter is about earning the right to act.

The ladder of causation (Pearl), in one breath.
① Seeing — "what is?"  $P(Y\mid X)$ — every supervised model lives here.
② Doing — "what if I act?"  $P(Y\mid do(X))$ — interventions, policy, pricing.
③ Imagining — "what if I had acted differently?" — counterfactuals, fairness, recourse.
Higher rungs answer questions lower rungs literally cannot. Most of ML is stuck on rung ①.
① Seeing   P(Y|X) ② Doing   P(Y|do X) ③ Imagining   counterfactual

37.1 — Why ML patterns are not causal

A standard supervised model learns \(P(Y \mid X)\): the distribution of the target given the features as they naturally occur. That is exactly the right object when you only want to observe and predict — you see someone’s features and guess their outcome. It is the wrong object the moment you want to act, because acting changes the world in a way that the observed correlation may not survive.

The canonical analogy: a model trained on a city’s data finds that ice-cream sales strongly predict drownings. It is a great predictor — on more ice cream days, more people drown. But banning ice cream will not save a single swimmer, because a hidden common cause, summer heat, drives both. The arrow the model “sees” between ice cream and drowning is real as a correlation and fictitious as a mechanism.

The formal way to say this: prediction asks “given that I observe \(X=x\), what is \(Y\)?” Causation asks “given that I set \(X=x\), what is \(Y\)?” These coincide only under special conditions (no confounding). When they differ, a high-accuracy model can recommend a catastrophic intervention.

flowchart LR
  H["Summer heat<br/>(hidden cause)"] --> I["Ice-cream sales"]
  H --> D["Drownings"]
  I -. "spurious correlation<br/>the model learns" .-> D

To make the danger concrete, imagine two worlds. In the observed world, ice cream and drownings rise and fall together because heat moves both. In the intervention world, you personally fix ice-cream sales at zero (you ban it) while letting heat do whatever it likes. The drownings barely move, because you snapped the link to ice cream but not the link from heat. A predictor trained on the first world confidently — and wrongly — promises a payoff in the second.

same graph, two questions heat ice cream drownings ✂ SEE: P(Y|X) — heat still feeds ice cream DO: P(Y|do X) — you cut heat→ice cream
Warning

“The model is 95% accurate” tells you nothing about whether acting on its inputs is safe. Accuracy measures fit to the observed world; interventions create a new world the training data never saw.

37.2 — Confounding

A confounder is a variable that influences both the treatment (the thing you might change) and the outcome (the thing you care about), creating a non-causal association between them. It is the central villain of causal inference. In the ice-cream story, heat is the confounder. In medicine, disease severity is the eternal confounder: sicker patients get more aggressive treatment and have worse outcomes, so naively the aggressive treatment “looks” harmful.

Plain-language version: a confounder is a puppeteer standing behind both the lever you pull and the result you watch. When you tug the lever and the result moves, you can’t tell whether you moved it or the puppeteer did. Causal inference is the art of cutting the puppeteer’s strings — either physically (randomization) or on paper (adjustment).

hidden cause C treatment outcome the fake link we observe

Worked example. Suppose a drug is given more often to severe cases. Look at recovery split by severity:

Severity Drug group No-drug group
Mild (100 patients) 18 / 20 = 90% 72 / 80 = 90%
Severe (100 patients) 56 / 80 = 70% 14 / 20 = 70%

Within each severity the drug looks neutral — 90% vs 90% for mild, 70% vs 70% for severe. But now pool the columns. The drug group is \(18+56 = 74\) recoveries out of \(20+80 = 100\), i.e. 74%, while the no-drug group is \(72+14 = 86\) out of \(80+20 = 100\), i.e. 86%. Pooled, the drug looks 12 points worse — purely because the drug group is dominated by severe patients who recover less often regardless. Severity confounds the drug–recovery link. The fix is to adjust (condition) on the confounder: compare drug vs no-drug within each severity stratum, then average over how common each stratum is. This is the seed of the backdoor adjustment (37.7).

Tip

Rule of thumb: a variable is a confounder candidate if you can draw arrows from it into both treatment and outcome. Things caused by the treatment are never confounders — adjusting for them is a mistake (see colliders/mediators in 37.7).

37.3 — Simpson’s paradox

Simpson’s paradox is confounding made dramatic: a trend that holds in every subgroup reverses when the subgroups are combined. It is not a statistical glitch — both the aggregated and disaggregated numbers are arithmetically correct. The paradox is that they tell opposite stories, and only causal reasoning picks the right one.

Worked example. Two treatments for kidney stones:

Stone size Treatment A Treatment B
Small 81/87 = 93% 234/270 = 87%
Large 192/263 = 73% 55/80 = 69%
Combined 273/350 = 78% 289/350 = 83%

A wins on small stones and on large stones, yet B wins overall. Why? Doctors gave A to the hard (large-stone) cases — 263 of A’s 350 patients had large stones — and gave B mostly to the easy small-stone cases. So B’s combined number is inflated by a flood of easy patients, while A’s is dragged down by hard ones. Stone size is the confounder; the subgroup numbers are the causally honest ones, because they hold the confounder fixed.

success rate A B small stones large stones A is above B in BOTH groups — yet B’s easy-case mix wins overall
Warning

You cannot resolve a Simpson reversal by looking at the numbers alone — both are correct. You need the causal graph to decide which table answers your question. Disaggregate when the subgroup variable is a confounder of the effect you want; aggregate when it is not.

37.4 — Potential outcomes and counterfactuals

The potential outcomes framework (Neyman–Rubin) gives causation a crisp definition. For each unit \(i\) and a binary treatment \(T \in \{0,1\}\), imagine two parallel outcomes:

  • \(Y_i(1)\) — the outcome if treated,
  • \(Y_i(0)\) — the outcome if not treated.

The individual causal effect is \(\tau_i = Y_i(1) - Y_i(0)\). The trouble — the fundamental problem of causal inference — is that you only ever observe one of the two. The other is the counterfactual: what would have happened under the road not taken. You can never see both for the same person, so individual effects are unobservable; we estimate averages instead.

Intuition: it’s like wanting to know whether a particular umbrella kept you dry today. You either carried it (and stayed dry) or didn’t (and got wet) — you can’t replay this exact afternoon both ways. The single thing you most want to measure is the one thing physics forbids you to observe. Causal inference’s whole trick is borrowing the missing half from other, comparable people.

patient i Y(1) observed ✓ Y(0) counterfactual ✕ effect τ = Y(1) − Y(0): one half always missing

Worked example. Picture five patients, with both potential outcomes filled in by an oracle (in reality you see only the column matching what they actually got):

Patient \(Y(0)\) \(Y(1)\) \(\tau_i = Y(1)-Y(0)\)
1 0 1 +1
2 1 1 0
3 0 0 0
4 0 1 +1
5 1 1 0

The true ATE is the average of the last column: \((1+0+0+1+0)/5 = 0.4\). In real life half these cells are hidden — each patient shows you only \(Y(0)\) or \(Y(1)\) — so you reconstruct that 0.4 from group averages, which only works if the groups are comparable.

The headline quantity is the Average Treatment Effect:

\[\text{ATE} = \mathbb{E}[Y(1) - Y(0)].\]

In words: on average across the whole population, how much better (or worse) is the outcome when everyone is treated versus when no one is — the gap between two whole worlds.

Also written: \(\text{ATE} = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]\) (linearity of expectation splits the difference of averages into a difference of two world-averages).

A naive estimate subtracts the average outcome of the treated from the average of the untreated. That equals the ATE only if the two groups are exchangeable — same distribution of confounders. Decompose it:

\[\underbrace{\mathbb{E}[Y\mid T{=}1]-\mathbb{E}[Y\mid T{=}0]}_{\text{naive difference}} = \underbrace{\text{ATE}}_{\text{want}} + \underbrace{\text{selection bias}}_{\text{confounding}}.\]

In words: the difference you actually measure between the treated and untreated equals the true effect you want plus a contamination term — how different the two groups already were before any treatment.

Also written: \(\text{bias} = \big(\mathbb{E}[Y\mid T{=}1]-\mathbb{E}[Y\mid T{=}0]\big) - \text{ATE}\); the naive comparison is unbiased exactly when this term is zero, which randomization guarantees.

flowchart TB
  U["Unit i"] --> Y1["Y_i(1): if treated"]
  U --> Y0["Y_i(0): if untreated"]
  Y1 --> O["Observe only ONE"]
  Y0 --> O
  O --> C["The other = counterfactual<br/>(never observed)"]

The conditions that license the naive comparison are ignorability \(\big(\{Y(0),Y(1)\}\perp T \mid X\big)\) — treatment is as-good-as-random once you condition on the covariates \(X\) — and positivity, \(0 < P(T{=}1\mid X) < 1\), every unit could have gone either way. Counterfactuals are also how modern AI defines fairness (“would the loan decision change if only the applicant’s race flipped?”) and explanation (“the smallest change to the input that flips the prediction”).

Note

ATE vs ATT vs CATE — three flavors of “the effect.” ATE averages over everyone. The Average Treatment effect on the Treated (ATT), \(\mathbb{E}[Y(1)-Y(0)\mid T{=}1]\), averages only over those who actually got the treatment — the right target when you’re asking “did the program help the people we enrolled?” (it’s what difference-in-differences in 37.8 estimates). The CATE (37.9) drills all the way down to a single covariate profile \(x\). Same ladder, different altitude: population → the treated subgroup → one individual-like cell.

37.5 — Randomized experiments (A/B tests) as the gold standard

If confounding comes from who gets treated, the cleanest fix is to remove choice entirely: assign treatment by a coin flip. Randomization makes the treatment independent of every covariate — measured or unmeasured — so the treated and control groups are exchangeable by construction. Selection bias vanishes, and the naive difference in means becomes an unbiased estimate of the ATE. This is why the randomized controlled trial (RCT), and its product-world cousin the A/B test, is the gold standard.

Why does a coin flip work where careful adjustment struggles? Because the coin doesn’t know anything about the patient — not their severity, not their genes, not the variables you forgot to record. It scatters every lurking confounder, named or unnamed, evenly across both arms. That is the one thing no amount of clever statistics on observational data can buy you.

mixed units ½ balanced arms A B

Worked example. Show variant B of a checkout page to a random 50% of visitors. Conversions: control 1,000/20,000 = 5.00%, variant 1,120/20,000 = 5.60%. Lift = 0.60 percentage points. Because assignment was random, no confounder can explain the gap — heavy buyers, mobile users, time-of-day all split evenly across arms in expectation. The only remaining question is whether 0.6 points is signal or noise, which a significance test answers:

import numpy as np
# tiny A/B test: is the 0.6pp lift real or noise?
nA = nB = 20000; cA, cB = 1000, 1120
pA, pB = cA/nA, cB/nB
p  = (cA+cB)/(nA+nB)                       # pooled rate under H0 (no effect)
se = np.sqrt(p*(1-p)*(1/nA+1/nB))          # std error of the difference
z  = (pB-pA)/se                            # two-proportion z-test
print(round(pB-pA, 4), round(z, 2))        # 0.006  ~2.71  -> significant

A \(z\) of about 2.7 sits beyond the usual \(\pm1.96\) cutoff, so the lift is unlikely to be chance — and because the arms were randomized, “real lift” here genuinely means “the change causes more conversions,” not merely “correlates with.”

The same logic, expressed in the industry-standard library:

from statsmodels.stats.proportion import proportions_ztest
import numpy as np
# successes and trials per arm
count = np.array([1000, 1120]); nobs = np.array([20000, 20000])
z, pval = proportions_ztest(count, nobs)   # two-proportion z-test
print(round(z, 2), round(pval, 4))         # ~ -2.71  0.0067  -> reject H0
Tip

Randomization buys you freedom from unmeasured confounders — its superpower. Observational methods (37.8) can only adjust for confounders you measured and named. When an experiment is feasible and ethical, run it.

Warning

RCTs fail quietly: broken randomization (users self-selecting the variant), interference (one user’s treatment affects another, common in social networks and marketplaces), and peeking at results repeatedly until significance appears. Each reintroduces the bias you paid randomization to remove.

37.6 — Pearl’s causal graphs (DAGs) and the do-operator

Judea Pearl’s framework draws assumptions as a directed acyclic graph (DAG) — the same graph machinery behind Bayesian networks: nodes are variables, an arrow \(X \to Y\) means \(X\) is a direct cause of \(Y\), and “acyclic” forbids a variable causing itself through a loop. The graph is not learned from data — it encodes your causal assumptions so they can be inspected and argued with.

The graph defines three elementary wirings that govern how information flows:

  • Chain \(X \to M \to Y\): \(M\) is a mediator; \(X\) and \(Y\) are associated, and conditioning on \(M\) blocks the path.
  • Fork \(X \leftarrow C \to Y\): \(C\) is a confounder; it creates association, and conditioning on \(C\) blocks it.
  • Collider \(X \to K \leftarrow Y\): \(K\) is a collision; \(X\) and \(Y\) start independent, and conditioning on \(K\) opens a spurious path (this is why adjusting for the wrong variable can create bias).

flowchart LR
  subgraph Fork["Fork (confounder C)"]
    C --> X1[X]
    C --> Y1[Y]
  end
  subgraph Chain["Chain (mediator M)"]
    X2[X] --> M --> Y2[Y]
  end
  subgraph Collider["Collider K"]
    X3[X] --> K
    Y3[Y] --> K
  end

A handy way to remember which way conditioning flips the switch: a fork or chain is a pipe — information flows until you close the valve by conditioning. A collider is a door that starts shut — conditioning props it open. So “control for it” helps at forks and chains, and backfires at colliders.

fork / chain = pipe (open; conditioning CLOSES) flows ▶ collider = door (shut; conditioning OPENS) blocked ✕ (until you condition)

A quick way to feel the collider effect: let talent and luck both independently cause a startup to get funded (\(K\)). In the whole population talent and luck are unrelated. But among funded startups only — conditioning on \(K\) — a low-talent founder must have been lucky, and a lucky one needn’t be talented, so the two become negatively correlated out of thin air. That manufactured correlation is the danger of “controlling for everything.” (This same mechanism is collider / selection bias: studying only hospital patients, only funded startups, or only users who clicked silently conditions on a collider and can flip the sign of an effect.)

The do-operator \(do(X{=}x)\) is Pearl’s notation for intervention: reach into the system, set \(X\) to \(x\), and — crucially — delete every arrow pointing into \(X\) (you overrode its natural causes). The interventional distribution \(P(Y \mid do(X{=}x))\) is what an action produces, and it generally differs from the observational \(P(Y\mid X{=}x)\). The entire game of causal inference from data is: re-express the do-quantity you want in terms of do-free observational quantities you can estimate. When that is possible, the effect is identifiable.

The “delete the incoming arrows” rule has a vivid name — graph surgery. Picture the DAG, then take scissors to every arrow feeding into \(X\), because you are now the only cause of \(X\). This is the formal difference between the two worlds of 37.1: observing leaves the graph intact; doing cuts it.

flowchart LR
  subgraph Obs["Observe: P(Y | X=x)"]
    Co["C"] --> Xo["X"]
    Co --> Yo["Y"]
    Xo --> Yo
  end
  subgraph Do["Do: P(Y | do(X=x)) — arrows into X cut"]
    Cd["C"] --> Yd["Y"]
    Xd["X=x"] --> Yd
  end

Warning

The collider trap is the subtlest mistake in the field. “Controlling for more variables” is not always safer — conditioning on a collider (or a descendant of one) manufactures a correlation that isn’t causal. More adjustment can mean more bias.

37.7 — Backdoor and frontdoor criteria

The DAG turns “which variables must I adjust for?” into a graph-reading exercise. A backdoor path from \(T\) to \(Y\) is any path that starts with an arrow into \(T\) (i.e., \(T \leftarrow \cdots Y\)) — these are the confounding routes that fake an effect. The backdoor criterion says: a set \(Z\) identifies the causal effect if \(Z\) blocks every backdoor path and contains no descendant of \(T\). Then:

\[P(Y\mid do(T{=}t)) = \sum_{z} P(Y\mid T{=}t, Z{=}z)\,P(Z{=}z).\]

In words: to find the effect of setting \(T\), look at the outcome within each group of people who share the same confounder values, then average those group-results weighted by how common each group is — never letting the treatment itself change the group mix.

Also written: \(P(Y\mid do(T{=}t)) = \mathbb{E}_{Z}\big[\,P(Y\mid T{=}t, Z)\,\big]\) — an expectation over the confounder distribution; for continuous \(Z\) swap the sum for \(\int P(Y\mid t,z)\,p(z)\,dz\).

This adjustment formula is exactly the stratify-then-average move from 37.2: estimate the effect within each level of the confounders, then weight by how common each level is. Re-using the drug numbers, \(Z=\{\)severity\(\}\) blocks the only backdoor. The drug recovery rate, adjusted, is \(0.90 \times P(\text{mild}) + 0.70 \times P(\text{severe})\); with an even 50/50 split of severity in the population that is \(0.90(0.5)+0.70(0.5) = 0.80\), identical to the no-drug adjusted rate — recovering the truth that the drug is neutral, which the pooled 74%-vs-86% comparison hid.

flowchart LR
  Z["Z (confounder)"] --> T["T (treatment)"]
  Z --> Y["Y (outcome)"]
  T -->|"effect we want"| Y
  T -. "backdoor T←Z→Y<br/>block by adjusting Z" .- Y

Sometimes you cannot measure the confounder — there is no valid backdoor set. The frontdoor criterion rescues you if the effect flows entirely through a measured mediator \(M\) that the confounder doesn’t touch.

The plain idea first: you can’t compare the treatment to the outcome directly (a hidden cause poisons that link), so you go around it through a clean middle step. The effect travels \(T \to M \to Y\). Each leg is uncontaminated: \(T \to M\) has no hidden confounder, and \(M \to Y\) can be cleaned by adjusting for \(T\). Measure each leg honestly, then multiply them — strong-first-leg times strong-second-leg gives the full effect, even though you never measured the puppeteer. Formally, picture \(U \to T \to M \to Y\) with an unmeasured \(U\) also pointing at \(T\) and \(Y\):

\[P(Y\mid do(T{=}t)) = \sum_{m} P(M{=}m\mid T{=}t)\sum_{t'} P(Y\mid M{=}m,T{=}t')\,P(T{=}t').\]

In words: push the effect through the middleman — first how strongly \(T\) moves the mediator \(M\), then how strongly \(M\) moves the outcome \(Y\) (averaged over treatment), and multiply the two legs together.

Also written: as a product of two clean interventional steps, \(P(Y\mid do(t)) = \sum_m P\big(m\mid do(t)\big)\,P\big(Y\mid do(m)\big)\), each leg itself identified by a backdoor adjustment.

The classic illustration: smoking \(\to\) tar in lungs \(\to\) cancer, with a hidden genetic confounder behind smoking and cancer. Even unable to measure the gene, you can identify smoking’s effect through tar.

Modern practice rarely does this graph-reading by hand. The DoWhy library lets you state the DAG, auto-discover the identifying formula, estimate it, and then refute it:

from dowhy import CausalModel
# df has columns: T (treatment), Y (outcome), Z (confounder)
model = CausalModel(data=df, treatment="T", outcome="Y", common_causes=["Z"])
estimand = model.identify_effect()                       # finds backdoor/frontdoor formula
est = model.estimate_effect(estimand,
        method_name="backdoor.propensity_score_weighting")
print(est.value)                                         # estimated ATE
# stress-test the assumptions, not just the point estimate:
model.refute_estimate(estimand, est, method_name="placebo_treatment_refuter")
Tip

Workflow: draw the DAG, list backdoor paths, find a blocking set with no descendants of \(T\). No valid set? Look for a frontdoor mediator. Neither? You need an instrument or a natural experiment (37.8).

37.8 — Observational methods

When you cannot randomize, you exploit structure in observational data to approximate an experiment. Each method below is a different bargain — a different assumption you trade for identification.

Matching. Pair each treated unit with an untreated unit that has near-identical covariates, so within a pair the only difference is treatment. Average the within-pair outcome gaps to estimate the effect. Intuition: build a synthetic control group that looks like the treated group on everything you measured. For example, match a 45-year-old male smoker who took the drug to a 45-year-old male smoker who did not, and the difference in their outcomes is an estimate stripped of age, sex, and smoking confounding.

Propensity scores. Matching on many covariates is hard (the curse of dimensionality). The propensity score \(e(x)=P(T{=}1\mid X{=}x)\) — the modeled probability of being treated — collapses all confounders into one number. Rosenbaum and Rubin proved that adjusting for \(e(x)\) alone removes the same bias as adjusting for all of \(X\). You then match, stratify, or inverse-propensity weight (re-weight each unit by \(1/e(x)\) for treated, \(1/(1-e(x))\) for control) to build a pseudo-population where treatment is independent of confounders.

The IPW estimator written out:

\[\widehat{\text{ATE}}_{\text{IPW}} = \frac{1}{n}\sum_{i=1}^{n}\left[\frac{T_i\,Y_i}{e(x_i)} - \frac{(1-T_i)\,Y_i}{1-e(x_i)}\right].\]

In words: up-weight the rare-but-informative units (a treated person who looked unlikely to be treated, or vice versa) so that, after re-weighting, the treated and control arms have the same confounder mix — then just take a difference of weighted averages.

Also written: \(\widehat{\text{ATE}}_{\text{IPW}} = \frac{1}{n}\sum_i \frac{(T_i-e(x_i))\,Y_i}{e(x_i)\,(1-e(x_i))}\) — a single fraction that reduces to the two-term form above.

import numpy as np
from sklearn.linear_model import LogisticRegression
# X: confounders, T: treatment, Y: outcome
e = LogisticRegression().fit(X, T).predict_proba(X)[:, 1]   # propensity e(x)
w = np.where(T==1, 1/e, 1/(1-e))                            # IPW weights
ate = np.average(Y[T==1], weights=w[T==1]) \
    - np.average(Y[T==0], weights=w[T==0])                 # weighted ATE

A unit with a low chance of treatment (\(e=0.1\)) who was treated is rare and informative, so weighting by \(1/0.1 = 10\) lets it stand in for the many similar units who went untreated — re-balancing the arms.

control arm ×1/e → up-weight the rare treated unit treated arm e=0.1 → weight 10
Note

Trim and stabilize. IPW blows up when a propensity nears 0 or 1 — one unit with \(e=0.001\) gets weight 1000 and hijacks the estimate (a positivity violation in disguise). In practice: clip propensities to, say, \([0.01,0.99]\), use stabilized weights, or prefer a doubly-robust estimator (AIPW), which stays consistent if either the propensity model or the outcome model is right — two shots at correctness instead of one.

Instrumental variables (IV). When an unmeasured confounder blocks the backdoor, find an instrument \(Z\) that (1) affects \(T\), (2) affects \(Y\) only through \(T\), and (3) shares no confounder with \(Y\). \(Z\) injects “as-if-random” variation into \(T\). The classic estimate (Wald) is the ratio of \(Z\)’s effect on \(Y\) to \(Z\)’s effect on \(T\):

\[\hat\tau = \frac{\text{Cov}(Z,Y)}{\text{Cov}(Z,T)}.\]

In words: see how much the nudge \(Z\) moves the outcome, divide by how much the same nudge moves the treatment, and the leftover ratio is the effect of treatment per unit of treatment — because \(Z\) touches \(Y\) only by way of \(T\).

Also written: \(\hat\tau = \dfrac{\mathbb{E}[Y\mid Z{=}1]-\mathbb{E}[Y\mid Z{=}0]}{\mathbb{E}[T\mid Z{=}1]-\mathbb{E}[T\mid Z{=}0]}\) for a binary instrument — the “reduced-form effect ÷ first-stage effect” form, equivalent to two-stage least squares (2SLS).

Example: to study schooling’s effect on wages, use distance to college as an instrument — it nudges enrollment but plausibly doesn’t touch wages except through schooling. If living near a college raises years of schooling by 0.3 and raises wages by $300, the implied return is \(300/0.3 = \$1{,}000\) per year of schooling.

Difference-in-differences (DiD). When a treatment hits one group at a known time, compare the change in the treated group to the change in an untreated control group over the same period. Subtracting the control’s trend removes anything common to both — under the parallel trends assumption (absent treatment, both groups would have moved together).

\[\widehat{\text{ATT}} = \big(\bar Y^{\text{treat}}_{\text{after}}-\bar Y^{\text{treat}}_{\text{before}}\big) - \big(\bar Y^{\text{ctrl}}_{\text{after}}-\bar Y^{\text{ctrl}}_{\text{before}}\big).\]

In words: measure how much the treated group changed, subtract how much the control group changed over the very same window, and what’s left is the part of the change only the treatment can explain.

Also written: as the interaction coefficient \(\beta_3\) in the regression \(Y = \beta_0 + \beta_1\,\text{Treat} + \beta_2\,\text{Post} + \beta_3\,(\text{Treat}\times\text{Post}) + \varepsilon\) — the “double difference” is exactly that interaction term.

Concretely, if a state raises its minimum wage and employment there moves from 20 to 19 (down 1) while a neighboring control state moves from 18 to 16 (down 2), DiD estimates the policy raised employment by \((-1) - (-2) = +1\) relative to the trend both states shared.

treatment time control treated counterfactual DiD = effect

Regression discontinuity (RD). When treatment is assigned by a sharp cutoff on a running variable — a test score above 60 gets the scholarship, below 60 doesn’t — units just above and just below the threshold are essentially identical except for treatment. Comparing outcomes in a narrow window around the cutoff gives a local randomized experiment for free. A student scoring 59 and one scoring 61 are indistinguishable in talent, so any later gap in their graduation rates is attributable to the scholarship, not the score.

cutoff running variable (e.g. test score) jump = local effect just-below ≈ just-above, except for treatment
Method Key assumption When to reach for it
Matching / propensity All confounders measured (ignorability) Rich covariates, no obvious unmeasured confounder
Instrumental variables Valid instrument exists Unmeasured confounding, but a natural nudge available
Difference-in-differences Parallel trends Policy rolled out to one group at a known time
Regression discontinuity Continuity at the cutoff Treatment decided by a threshold rule
Warning

Every observational method rests on an untestable assumption (ignorability, instrument validity, parallel trends, continuity). The data cannot confirm it — only domain knowledge and the DAG can defend it. State the assumption explicitly and stress-test it; a “causal estimate” with an unstated assumption is just a correlation in a lab coat.

37.9 — Sensitivity analysis: how wrong can the assumption be?

Every observational estimate (37.8) leans on an untestable assumption — usually “no unmeasured confounding.” You cannot prove it. But you can ask the next-best question: how strong would a hidden confounder have to be to overturn my conclusion? That is sensitivity analysis, and it converts an all-or-nothing leap of faith into a measurable margin of safety.

Plain-language version: instead of swearing there’s no puppeteer, you ask “if there were a hidden puppeteer, how hard would it have to pull to flip my answer from ‘the drug helps’ to ‘the drug does nothing’?” A finding that survives only a feather-light hidden confounder is fragile; one that needs a confounder stronger than any measured variable is robust.

A widely used, model-light summary is the E-value: the minimum strength of association (on the risk-ratio scale) that an unmeasured confounder would need with both treatment and outcome to fully explain away the observed effect.

\[\text{E-value} = \text{RR} + \sqrt{\text{RR}\,(\text{RR}-1)},\]

for an observed risk ratio \(\text{RR} \ge 1\) (for \(\text{RR}<1\), apply the formula to \(1/\text{RR}\)).

In words: take the effect you measured, and report the smallest “hidden-cause strength” — how strongly some unmeasured variable must move both the treatment and the outcome — that could be hiding the truth that there’s really no effect.

Also written: equivalently \(\text{E-value} = \text{RR} + \sqrt{\text{RR}^2 - \text{RR}}\) — the same quantity, just factoring \(\text{RR}(\text{RR}-1)\) under the root.

Worked example. You observe a risk ratio of \(\text{RR}=2.0\). Then \(\text{E-value} = 2 + \sqrt{2(2-1)} = 2 + \sqrt{2} \approx 3.41\). Reading: a hidden confounder would have to be associated with both treatment and outcome by a risk ratio of at least 3.41 — more than doubling the observed effect’s strength — to reduce the true effect to nothing. If none of your measured covariates come anywhere near a 3.41 association, an unmeasured one that strong is implausible, and the finding is robust. If your strongest measured confounder already sits at 3, an unmeasured peer could plausibly erase it.

import numpy as np
def e_value(rr):
    rr = rr if rr >= 1 else 1/rr        # work on the >=1 side
    return rr + np.sqrt(rr*(rr-1))
print(round(e_value(2.0), 2))           # 3.41

The classic companion technique is Rosenbaum bounds for matched studies, which report the hidden-bias magnitude \(\Gamma\) (how unequal two matched units’ treatment odds could secretly be) at which the result stops being significant. Same spirit, different scale.

Tip

Report a sensitivity number alongside every observational point estimate. “ATE = 0.4, E-value 3.4” is a far more honest claim than “ATE = 0.4,” because it tells the reader exactly how much unmeasured confounding the conclusion can absorb before it breaks.

37.10 — Mediation: direct vs indirect effects

Backdoor adjustment answers “does \(T\) affect \(Y\), and by how much?” Mediation analysis answers the next question managers and scientists actually care about: “how does it work — through which channel?” It splits the total effect into the part that flows through a mediator \(M\) and the part that goes around it.

Intuition: a wellness program (\(T\)) lowers sick days (\(Y\)). Some of that works because the program improves sleep (\(M\)) — the indirect path \(T\to M\to Y\). The rest works through everything else — better diet, stress, morale — the direct path \(T\to Y\). Knowing the split tells you whether to double down on the sleep component or look elsewhere.

The total effect decomposes (in the simple linear, no-interaction case) as a sum:

\[\underbrace{\text{TE}}_{\text{total}} = \underbrace{\text{NDE}}_{\text{direct, around }M} + \underbrace{\text{NIE}}_{\text{indirect, through }M},\]

where NDE is the natural direct effect and NIE the natural indirect effect.

In words: the whole effect of the treatment equals the bit that reaches the outcome bypassing the mediator, plus the bit that reaches it by moving the mediator first.

Also written: in a linear model \(M = \alpha\,T + \dots\) and \(Y = \beta\,T + \gamma\,M + \dots\), the indirect effect is the product \(\alpha\gamma\) and the direct effect is \(\beta\), so \(\text{TE} = \beta + \alpha\gamma\) — the celebrated “product-of-coefficients” (Baron–Kenny) form.

flowchart LR
  T["T (treatment)"] -->|"direct β (NDE)"| Y["Y (outcome)"]
  T -->|"α"| M["M (mediator)"]
  M -->|"γ"| Y
  T -. "indirect = α·γ (NIE)" .- Y

Worked example. Fit \(M = 0.5\,T + \varepsilon\) and \(Y = 0.2\,T + 0.6\,M + \varepsilon\). Indirect effect \(=\alpha\gamma = 0.5\times0.6 = 0.30\); direct effect \(=\beta = 0.20\); total \(=0.50\). So 60% of the treatment’s benefit is mediated through \(M\) — a quantitative answer to “how much of the program’s win comes through sleep?”

Warning

Mediation is treacherous: the mediator \(M\) is itself an outcome of \(T\), so any confounder of the \(M\to Y\) link (even one unrelated to \(T\)) biases the split — and you must not simply “adjust for \(M\)” the way you adjust for a confounder. Modern causal mediation (Pearl, Imai) defines NDE/NIE through nested counterfactuals and needs stronger assumptions than a total-effect estimate. Treat a clean 60/40 split as a hypothesis, not a verdict.

37.11 — Uplift and treatment-effect modeling

The ATE is one number for the whole population, but interventions are usually targeted: you want to act on the units the action actually helps. Uplift modeling (a.k.a. treatment-effect or heterogeneous-effect modeling) estimates the Conditional Average Treatment Effect:

\[\text{CATE}(x) = \mathbb{E}[Y(1)-Y(0)\mid X{=}x],\]

the effect for a unit with features \(x\).

In words: zoom the average effect down to one kind of person — among everyone who looks like \(x\), how much does the treatment change their outcome.

Also written: \(\tau(x) = \mu_1(x) - \mu_0(x)\) where \(\mu_t(x)=\mathbb{E}[Y\mid X{=}x, T{=}t]\) under ignorability — the gap between two outcome-response surfaces evaluated at \(x\).

In marketing this sorts customers into four types:

  • Persuadables — buy only if contacted (positive uplift; target these).
  • Sure things — buy either way (zero uplift; wasted spend).
  • Lost causes — never buy (zero uplift; wasted spend).
  • Sleeping dogs — buy unless contacted; the message annoys them (negative uplift; actively avoid).

A predictive (propensity-to-buy) model would happily target sure things, who convert anyway, and sleeping dogs, who you then drive away. Only an uplift model finds the persuadables. The simplest estimator is the T-learner: fit one model \(\hat\mu_1(x)\) on the treated, another \(\hat\mu_0(x)\) on the control, and predict \(\widehat{\text{CATE}}(x)=\hat\mu_1(x)-\hat\mu_0(x)\).

uplift = P(buy | emailed) − P(buy | left alone) persuadable + target sure thing 0 skip lost cause 0 skip sleeping dog − avoid uplift →

Worked example. Suppose for a given customer the treated-outcome model predicts a 0.30 purchase probability if emailed and the control model predicts 0.45 if left alone. The uplift is \(0.30 - 0.45 = -0.15\) — negative. This customer is a sleeping dog: emailing them lowers their purchase chance by 15 points. A plain propensity model sees only the 0.30 and might still target them; the uplift model correctly says leave them alone.

# T-learner uplift: two models, subtract their predictions
from sklearn.ensemble import GradientBoostingRegressor as GBR
m1 = GBR().fit(X[T==1], Y[T==1])         # outcome model | treated
m0 = GBR().fit(X[T==0], Y[T==0])         # outcome model | control
uplift = m1.predict(X) - m0.predict(X)   # per-unit CATE; target uplift > 0

The T-learner is the warm-up, not the state of the art. The S-learner (one model with \(T\) as a feature), X-learner (better under treatment-imbalance), and doubly-robust / DR-learner are stronger meta-learners; and the Causal Forest estimates \(\tau(x)\) with honest, tree-based splitting plus confidence intervals. The EconML and causalml libraries package these:

from econml.dml import CausalForestDML
from sklearn.ensemble import GradientBoostingRegressor as GBR
# Y outcome, T treatment, X effect-modifiers, W confounders to adjust for
est = CausalForestDML(model_y=GBR(), model_t=GBR(), discrete_treatment=True)
est.fit(Y, T, X=X, W=W)
tau_hat = est.effect(X)                  # per-unit CATE estimates
lb, ub  = est.effect_interval(X)         # with confidence intervals

quadrantChart
  title Who to target (uplift = effect of contacting)
  x-axis "Buys if NOT contacted" --> "Buys if contacted"
  quadrant-1 Persuadables (target)
  quadrant-2 Sure things (skip)
  quadrant-3 Lost causes (skip)
  quadrant-4 Sleeping dogs (avoid)

Tip

Evaluate uplift models with a Qini curve (cumulative incremental gain as you treat from highest predicted uplift downward), not accuracy. Accuracy rewards predicting who buys; Qini rewards predicting whom your action changes — the only thing that pays for the campaign.

37.12 — Causal discovery: learning the graph from data

So far the DAG has fallen from the sky — drawn by hand from domain knowledge. But what if you don’t know the arrows? Causal discovery (a.k.a. structure learning) tries to recover the graph, or at least the parts of it the data can pin down, from observational (and sometimes interventional) data.

The honest headline first: from observational data alone you can generally recover the graph only up to its Markov equivalence class — the set of DAGs that imply the exact same conditional independencies. Plain version: the data can often tell you that two variables are connected, but not always which way the arrow points, because \(X\to Y\) and \(X\leftarrow Y\) can fit the same correlations. Extra leverage — interventions, time order, or assumptions about the noise — is what breaks the tie.

Two broad families:

  • Constraint-based (e.g. PC, FCI): run a battery of conditional-independence tests and keep only the edges the data refuses to make independent, then orient what you can using collider patterns. FCI even tolerates hidden confounders.
  • Score-based (e.g. GES, and modern continuous-optimization methods like NOTEARS): search over graphs for the one that best scores the data (fit minus a complexity penalty), turning structure search into something closer to ordinary optimization.
# constraint-based discovery with causal-learn
from causallearn.search.ConstraintBased.PC import pc
import numpy as np
data = np.asarray(df)                    # rows = samples, cols = variables
cg = pc(data)                            # returns a CPDAG (equivalence class)
cg.draw_pydot_graph(labels=list(df.columns))
Warning

Discovered graphs are hypotheses, not facts. They rest on assumptions (faithfulness, often no hidden confounders, correct independence tests) that are as untestable as the ones in 37.8. Treat the output as a candidate DAG to argue with and validate against domain knowledge — never as ground truth handed down by the algorithm.

37.13 — Why causal thinking matters for robust, fair, actionable AI

Causal reasoning is not an academic luxury bolted onto ML — it is what makes a model survive contact with the real world.

Robustness / generalization. Correlations are tied to the environment that produced them; causal mechanisms are stable across environments. A model leaning on a spurious correlation (cows appear on grass, so “grass” predicts “cow”) shatters under distribution shift — show it a cow on a beach and it fails. Learning causal features is the path to out-of-distribution robustness — a deep link to invariance and domain generalization.

Fairness. Many fairness questions are inherently counterfactual: “would this decision have changed had the applicant been a different gender, all else equal?” Counterfactual fairness (AI ethics & fairness) answers it with the causal graph, separating a protected attribute’s legitimate pathways from its discriminatory ones — something no correlation-based metric can do.

Actionability. A loan model that says “denied because your balance is low” is only useful if lowering-the-cause-actually-changes-the-outcome. Recourse — telling someone what to change to flip the decision, a staple of explainable AI — is a do-query, not a prediction. Likewise, every policy, pricing, and treatment decision is a \(do(\cdot)\) question; answering it with \(P(Y\mid X)\) is the original sin this chapter exists to prevent.

flowchart LR
  C["Causal thinking"] --> R["Robust:<br/>stable under shift"]
  C --> F["Fair:<br/>counterfactual fairness"]
  C --> A["Actionable:<br/>recourse & policy = do()"]

Tip

The one-line test before deploying any model to drive a decision: “Am I asking what is, or what would happen if I act?” If it’s the second, you need a causal estimand and the assumptions to identify it — predictive accuracy alone will not save you.

37.14 — Quick reference

Term / formula Meaning in one line When / why it matters
\(P(Y\mid X)\) vs \(P(Y\mid do(X))\) Seeing vs doing — observe a value vs set it The whole chapter: predictions are not interventions
Confounder Variable causing both treatment and outcome Fakes a non-causal association; must be adjusted away
Simpson’s paradox Subgroup trend reverses on aggregation Warns you to disaggregate by the confounder
Potential outcomes \(Y(1),Y(0)\) The two parallel outcomes per unit Defines the effect \(\tau_i = Y(1)-Y(0)\); only one is seen
\(\text{ATE}=\mathbb{E}[Y(1)-Y(0)]\) Average effect over the whole population The headline target of most causal questions
ATT / CATE Effect on the treated / for profile \(x\) Targeting subgroups or individuals, not the average
Ignorability + positivity Treatment as-good-as-random given \(X\); every unit could go either way The two assumptions that license adjustment
Randomization (RCT / A/B) Coin-flip assignment Gold standard — neutralizes unmeasured confounders
DAG + \(do\)-operator Graph of assumptions; intervention cuts arrows into \(X\) Turns identification into graph-reading (graph surgery)
Fork / chain / collider Confounder / mediator / collision wiring Tells you what to adjust — and what not to (colliders)
Backdoor adjustment \(\sum_z P(Y\mid t,z)P(z)\) Identify the effect when confounders are measured
Frontdoor criterion Route effect through a clean mediator \(M\) Identify even with an unmeasured confounder
Propensity score \(e(x)\) \(P(T{=}1\mid X)\) — confounders in one number Match / stratify / IPW without high-dim matching
IPW / doubly-robust (AIPW) Re-weight by \(1/e(x)\); two-model safety net Observational ATE; AIPW needs only one model right
Instrumental variable Nudge \(Z\) affecting \(Y\) only through \(T\) Unmeasured confounding but a natural experiment exists
Difference-in-differences Treated change minus control change Policy hits one group at a known time (parallel trends)
Regression discontinuity Compare units just above/below a cutoff Treatment assigned by a sharp threshold rule
E-value \(=\text{RR}+\sqrt{\text{RR}(\text{RR}-1)}\) Confounder strength needed to erase the effect Sensitivity — how robust an observational finding is
Mediation NDE + NIE Split total effect into direct + through-\(M\) Answers how a treatment works, not just whether
Uplift / CATE (T/S/X-learner) Per-unit effect estimate Target persuadables, avoid sleeping dogs; evaluate via Qini

37.15 — Key takeaways

  • \(P(Y\mid X)\) (observing) and \(P(Y\mid do(X))\) (acting) are different objects; ML learns the first, decisions need the second. Pearl’s ladder — seeing, doing, imagining — names the climb.
  • A confounder drives both treatment and outcome and fakes an effect; Simpson’s paradox is confounding so strong the trend reverses on aggregation.
  • Potential outcomes define the effect as \(Y(1)-Y(0)\); you never see both, so individual effects are unobservable and we estimate averages (ATE, ATT) under ignorability and positivity.
  • Randomized experiments / A/B tests are the gold standard because randomization neutralizes even unmeasured confounders.
  • DAGs + the do-operator turn identification into graph-reading (graph surgery cuts arrows into \(X\)); the backdoor criterion adjusts for confounders, the frontdoor routes the effect through a mediator, and colliders warn that adjusting for the wrong variable creates bias.
  • Observational methods (matching, propensity scores / IPW, IV, difference-in-differences, regression discontinuity) each approximate an experiment by trading data for one untestable assumption — state it; prefer doubly-robust estimators when you can.
  • Sensitivity analysis (E-value, Rosenbaum bounds) quantifies how strong an unmeasured confounder would have to be to overturn a finding — report it alongside every observational estimate.
  • Mediation analysis splits a total effect into direct and indirect (through-the-mediator) paths, answering how a treatment works — but needs stronger assumptions than estimating the total effect.
  • Uplift / CATE models estimate per-unit effects to target persuadables and avoid sleeping dogs; meta-learners and causal forests scale this, and you evaluate with Qini, not accuracy.
  • Causal discovery can propose a graph from data, but only up to a Markov equivalence class — treat the output as a hypothesis.
  • Causal thinking is what makes AI robust (stable under shift), fair (counterfactual fairness), and actionable (recourse and policy are \(do\)-queries).

37.16 — See also

  • Probability & Statistics — distributions, expectation, hypothesis testing, the foundation for ATE estimation and A/B significance.
  • Probabilistic Graphical Models — Bayesian networks and d-separation, the machinery behind DAGs and the backdoor criterion.
  • Model Evaluation & Tuning — experimental design, cross-validation, and the testing logic that A/B tests build on.
  • Explainable AI & Interpretability — counterfactual explanations and recourse, the applied cousins of causal queries.
  • AI Ethics, Fairness & Safety — counterfactual fairness and the harms of acting on spurious correlation.
  • Frontier & Emerging Directions — causal representation learning and invariance for out-of-distribution robustness.

↪ The thread continues → Chapter 38 · ⚖️ AI Ethics, Fairness & Safety

Understanding and causation feed straight into responsibility — fairness, bias, privacy, and safety: the ethics of systems that act on real people.


📖 All chapters  |  ← 36 · 🔍 Explainable AI & Interpretability  |  38 · ⚖️ AI Ethics, Fairness & Safety →

 

© Kader Mohideen