flowchart LR A[Raw market & account data] --> B[Feature engineering<br/>lags, ratios, rolling stats] B --> C[Walk-forward training<br/>past only] C --> D[Model: scoring / signal / risk] D --> E[Decision: trade / approve / flag] E --> F[Regulator & audit log<br/>explainability required] D -.->|monitor drift| C E -.->|labels arrive late| B
Chapter 28 — 🏦 ML Across Industries
📖 All chapters | ← 27 · 🚨 Anomaly & Fraud Detection | 29 · 🔧 MLOps & Deployment →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
Most of this encyclopedia teaches algorithms in the abstract. This chapter is about what happens when those algorithms meet payrolls, patients, and regulators — the domains where ML stops being a benchmark score and starts deciding who gets a loan, which tumor gets flagged, and how a trade is priced. The striking thing about applied ML is that the same handful of failure modes — bad data, shifting distributions, evaluation that misses the real goal — recur across wildly different industries. Learn the patterns once and you can read any new domain quickly.
🧭 In context: Applied ML Systems & Industries · turning models into deployed decisions in finance, healthcare, and beyond · the one key idea — the hard part is rarely the model, it is the data, the drift, and matching the metric to the business.
💡 Remember this: In real-world ML the model is the easy part — projects live or die on data quality, distribution shift, and whether your evaluation metric actually matches the business goal.
28.1 — Finance & Trading
Finance was an early adopter of ML because it is drowning in structured, timestamped, labeled data and every basis point of accuracy maps directly to money. But it is also one of the hardest domains, for one deep reason: the data fights back. In most fields the world does not change because you modeled it; in markets, the moment a profitable pattern is discovered and traded, it gets arbitraged away. The four canonical finance applications each wrestle with this in their own way.
Fraud detection asks: is this transaction legitimate? It is a classic imbalanced binary classification problem — perhaps 1 fraudulent transaction in 10,000. The cost structure is asymmetric: missing a $5,000 fraud (a false negative — a real bad event the model let through) costs far more than wrongly declining a legitimate $40 coffee (a false positive — a benign event the model flagged), but too many false positives drive customers away. Because fraud is a cross-industry workhorse, the mechanics live in Chapter 27; here the finance-specific twist is that fraudsters adapt — your training labels describe last month’s scams, not next month’s.
Credit scoring predicts the probability a borrower defaults. Logistic regression — a model that outputs a probability as a weighted sum of features passed through an S-curve — still dominates here, not because it is the most accurate, but because regulators require that you can explain every decision (“your application was declined because of X”). A gradient-boosted tree might score higher on AUC (area under the ROC curve, a single number summarizing how well a model ranks positives above negatives) yet be unusable if it cannot produce a legally defensible adverse-action reason.
Algorithmic trading uses models to decide what and when to buy or sell. This is where non-stationarity bites hardest — a signal that backtested beautifully (was simulated on historical data) on 2015–2019 data can be pure noise by 2021.
Risk modeling estimates potential losses to size positions and hold capital. A standard measure is Value-at-Risk (VaR): the loss your portfolio will not exceed on, say, 99% of days. If a one-day 99% VaR is $1M, you expect to lose more than that on only about 1 trading day in 100.
Formally, VaR is a quantile of the loss distribution:
\[ \text{VaR}_{\alpha} = \inf\{\ell : P(L > \ell) \le 1-\alpha\} \]
In words: the smallest loss amount such that the chance of losing more than it is at most \(1-\alpha\) (e.g. 1%). It is the cutoff where the bad tail begins. Also written: \(\text{VaR}_{\alpha} = F_L^{-1}(\alpha)\) — the \(\alpha\)-quantile (inverse CDF) of the loss \(L\); e.g. the 99th percentile loss for \(\alpha=0.99\).
VaR has a famous blind spot: it says where the tail starts but nothing about how bad it gets beyond. Expected Shortfall (ES), also called Conditional VaR, fixes that by averaging the losses past the VaR cutoff:
\[ \text{ES}_{\alpha} = \mathbb{E}\!\left[L \mid L \ge \text{VaR}_{\alpha}\right] \]
In words: given that we are already in the worst \(1-\alpha\) of days, how much do we lose on average? It is the average size of the disaster, not just its doorstep. Also written: \(\text{ES}_{\alpha} = \frac{1}{1-\alpha}\int_{\alpha}^{1} \text{VaR}_{u}\,du\) — the average of all VaRs deeper in the tail.
Post-2008 regulation (Basel) shifted capital rules toward ES precisely because two portfolios can share a VaR while one hides far uglier tail losses.
Here is the core difficulty made concrete. Non-stationarity means the statistical relationship between features and target changes over time. Suppose a credit model learns that a debt-to-income ratio above 0.4 signals high default risk, calibrated in a low-interest-rate era. Rates spike, and now 0.4 is normal and 0.55 is the new danger line. The model is silently wrong — its inputs still look familiar, but the rule it learned no longer holds.
The visual below shows the trap: the model’s learned threshold (dashed line) stays put while the real danger line drifts away from it.
import numpy as np
# Why a naive random train/test split LIES in finance: it leaks the future.
np.random.seed(0)
n = 1000
t = np.arange(n)
# a "signal" whose sign flips halfway through (a regime change)
signal = np.where(t < 500, 1.0, -1.0)
y = signal + np.random.randn(n)*0.5 # noisy returns
# WRONG: shuffle then split -> test set contains BOTH regimes, model looks great
idx = np.random.permutation(n)
# RIGHT: walk-forward -> train on the past, test on the genuinely unseen future
train, test = t[:500], t[500:]
# A model fit on regime 1 (signal=+1) keeps predicting +1 for the future...
pred_sign = +1.0
realized = np.sign(y[test]).mean() # but the future regime is actually negative
print(round(realized, 3)) # ~ -0.9: the past was no guideThe lesson: in finance you must evaluate with a time-ordered, walk-forward split — train only on data from before the test period, then roll forward. A shuffled split tells you a comforting lie because it lets the model peek at the future.
Think of walk-forward validation like grading a weather forecaster: you only ever judge tomorrow’s forecast against tomorrow’s weather, never let them peek at the actual outcome first. scikit-learn ships exactly this as TimeSeriesSplit, so you rarely need to hand-roll the index bookkeeping:
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LogisticRegression
import numpy as np
X = np.random.randn(1000, 5)
y = (X[:, 0] + np.random.randn(1000) > 0).astype(int)
tscv = TimeSeriesSplit(n_splits=5) # 5 expanding train windows, each tested on the next block
for fold, (tr, te) in enumerate(tscv.split(X)):
# tr is always entirely BEFORE te in time — no future leaks backward
model = LogisticRegression().fit(X[tr], y[tr])
acc = model.score(X[te], y[te])
print(f"fold {fold}: train={len(tr):4d} test={len(te):3d} acc={acc:.3f}")Two more finance-specific hazards round out the picture. Regulation is not an afterthought — frameworks like the Equal Credit Opportunity Act, Basel capital rules, and model-risk-management guidance (SR 11-7) require documented validation, bias testing, and human accountability, which is why uninterpretable black boxes face an uphill battle in lending. And label latency: you often do not learn whether a loan defaults for months or years, so your “labels” for recent data are incomplete, making fresh evaluation genuinely hard.
The deadliest bug in financial ML is lookahead bias — letting information from the future sneak into training. It creeps in through shuffled splits, features computed over the whole series (a global mean that secretly includes future values), or labels timestamped before they were actually knowable. A backtest with lookahead bias always looks brilliant and always loses money live.
Rule of thumb: if your finance model’s backtested Sharpe ratio (return divided by its volatility — reward per unit of risk) looks too good to be true, you have a data leak, not an edge. Hunt the leak before celebrating.
28.1.1 — A worked Sharpe-ratio sanity check
The Sharpe ratio is the most quoted number in quantitative finance, so it pays to read its formula directly. Picture two strategies that both earn 10% a year: one drifts up smoothly, the other lurches violently to get there. The smooth one is “better” risk-adjusted, and Sharpe is how we put a number on that.
\[ \text{Sharpe} = \frac{\mathbb{E}[R_p - R_f]}{\sigma_p} \;\times\; \sqrt{T} \]
In words: average excess return (your return minus the risk-free rate) divided by how bumpy that return is, scaled up to an annual figure by \(\sqrt{T}\) (e.g. \(\sqrt{252}\) trading days). Also written: \(\text{Sharpe} = \dfrac{\bar{r} - r_f}{\text{std}(r)}\sqrt{252}\) — the same ratio with the sample mean and standard deviation of daily returns.
import numpy as np
daily = np.random.randn(252) * 0.01 + 0.0005 # toy daily returns, slight positive drift
rf = 0.0 # assume ~0 daily risk-free for the demo
sharpe = (daily.mean() - rf) / daily.std() * np.sqrt(252)
print("annualized Sharpe:", round(sharpe, 2))
# Reality check: live equity strategies rarely sustain Sharpe > 2.
# A backtest showing 6+ almost always means lookahead bias or overfitting.
assert daily.std() > 0 # Sharpe is undefined for zero-volatility returns28.2 — Healthcare & Bioinformatics
Healthcare is the mirror image of finance: data is scarcer, messier, and far less standardized, but the stakes are measured in lives, and a single confident mistake is catastrophic in a way a bad trade never is. The four flagship applications show ML at its most promising and most perilous.
Medical imaging — classifying X-rays, CT, MRI, and pathology slides — is deep learning’s biggest healthcare success. A convolutional network (a vision model that scans an image with small learned filters; Chapter 15) can flag diabetic retinopathy or detect lung nodules at radiologist-level accuracy on benchmark sets. Diagnosis from structured records and labs predicts disease risk and triages patients. Drug discovery uses models to screen millions of candidate molecules and predict binding affinity or toxicity, collapsing years of wet-lab work into a ranked shortlist. Genomics predicts which gene variants are pathogenic and how DNA sequence maps to biological function.
Let a tiny worked example expose why accuracy is a trap in healthcare. Imagine a screening test for a disease with 1% prevalence in 10,000 patients. A confusion matrix lays out the four outcomes — what the model said versus the truth:
| Has disease (100) | Healthy (9,900) | |
|---|---|---|
| Model says “sick” | 90 (true positive) | 495 (false positive) |
| Model says “healthy” | 10 (false negative) | 9,405 (true negative) |
This model catches 90% of sick patients (sensitivity = true positives ÷ all truly sick = 0.90) — clinically decent. Yet of everyone it flags as sick, only \(90/(90+495) \approx 15\%\) actually are; the rest get needless anxiety and follow-up tests. Meanwhile a useless model that says “healthy” to everyone scores \(9{,}910/10{,}000 = 99.1\%\) accuracy while catching zero patients. Accuracy rewards the lazy model; the metrics that matter are sensitivity (of the truly sick, what fraction did we catch?), specificity (of the truly healthy, what fraction did we correctly clear?), and positive predictive value (of those we flagged, what fraction are truly sick?). Which to optimize depends on the cost of a missed case versus a false alarm.
The relationship that makes this swing so violently is Bayes’ rule for PPV — it shows precisely how the answer depends on prevalence, not just on how good the test is:
\[ \text{PPV} = \frac{\text{sens}\cdot p}{\text{sens}\cdot p + (1-\text{spec})\cdot(1-p)} \]
In words: of everyone the test flags, the share who are truly sick equals the true alarms (sensitivity times how common the disease is) divided by all alarms (true plus false). When the disease is rare (\(p\) tiny), the false alarms from the huge healthy pool dominate and PPV crashes. Also written: \(\text{PPV} = \dfrac{\text{TP}}{\text{TP}+\text{FP}}\) — the same quantity counted directly from the confusion-matrix cells.
TP, FP, FN, TN = 90, 495, 10, 9405
sensitivity = TP/(TP+FN) # 0.90 fraction of the truly sick we caught
specificity = TN/(TN+FP) # 0.95 fraction of the healthy we cleared
ppv = TP/(TP+FP) # 0.15 of those flagged, fraction really sick
print(round(sensitivity,2), round(specificity,2), round(ppv,2))The same sensitivity and specificity, applied to a rarer disease, give an even worse PPV — because the few true cases are swamped by false positives drawn from a huge healthy pool. The chart below shows how PPV collapses as prevalence falls, even with a strong (90%/95%) test.
flowchart TD
A[Hospital A data] --> M[Train model]
M --> V1[Internal validation<br/>same hospital — looks great]
M --> V2[External validation<br/>Hospital B, new scanner]
V2 --> R{Accuracy holds?}
R -->|No| S[Distribution shift:<br/>different machine, population, protocol]
R -->|Yes| D[Cautious clinical pilot<br/>human-in-the-loop]
The defining healthcare pitfall is validation that does not generalize. A model trained on one hospital’s scanners routinely collapses at the next hospital because it secretly learned the scanner, not the disease — a notorious case had a pneumonia model keying on a portable-X-ray metadata token that appeared mostly on sicker patients. This is why external validation on a genuinely different site is the gold standard, not internal cross-validation, which only ever tests on data from the same source.
Bias is the second peril, and it is not abstract. If a dermatology model is trained mostly on light skin, it underperforms on dark skin; if a risk algorithm uses healthcare spending as a proxy for need, it will under-flag populations that historically received less care, baking the inequity into the model. Interpretability is the third: a clinician will not — and ethically should not — act on a black-box “cancer: 0.83” without knowing what drove it (covered in depth in Chapter 36).
28.2.1 — Computing the metrics with scikit-learn
In practice nobody divides confusion-matrix cells by hand; scikit-learn reports every clinical metric and, crucially, lets you move the decision threshold to trade sensitivity against specificity — the single most important knob in a screening tool.
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
y_true = np.array([0]*9900 + [1]*100) # 1% prevalence
# pretend scores: healthy cluster low, sick cluster higher but overlapping
y_score = np.concatenate([np.random.beta(2, 8, 9900),
np.random.beta(5, 3, 100)])
for thresh in (0.5, 0.3): # lower threshold => catch more, flag more
y_pred = (y_score >= thresh).astype(int)
tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
sens, spec = tp/(tp+fn), tn/(tn+fp)
ppv = tp/(tp+fp) if tp+fp else 0
print(f"thr={thresh}: sens={sens:.2f} spec={spec:.2f} PPV={ppv:.2f}")
print("AUROC:", round(roc_auc_score(y_true, y_score), 3)) # threshold-free ranking qualityLowering the threshold from 0.5 to 0.3 catches more disease (higher sensitivity) but floods the clinic with false alarms (lower PPV) — the curve, not a single number, is the real deliverable.
Reporting accuracy on an imbalanced medical dataset is the single most common way healthcare ML papers mislead. Always demand sensitivity, specificity, PPV, and a confusion matrix, plus the disease prevalence in the test set — as the curve above shows, the same model’s PPV swings wildly with prevalence.
28.3 — Recommender Systems & Security/Fraud as Cross-Industry Workhorses
Some ML applications are not tied to one industry at all — they show up everywhere, from streaming to banking to e-commerce to enterprise security. Two stand out as cross-industry workhorses, and recognizing them as reusable patterns rather than bespoke projects saves enormous effort.
Recommender systems answer “what should this user see next?” — products on a retailer, films on a streaming service, posts in a feed, securities in a robo-advisor, even job postings. The same core machinery — collaborative filtering (recommend what similar users liked), matrix factorization (compress the user-item history into dense embeddings, learned numeric vectors that place similar users and items near each other), and ranking models — is retargeted per domain. The full treatment is in Chapter 26; the cross-industry lesson is that any business with users, items, and a history of interactions has a recommendation problem hiding in it, and the engineering patterns transfer almost wholesale.
Security and fraud detection answer “is this event anomalous or malicious?” — fraudulent payments, account takeovers, network intrusions, insurance fraud, fake reviews. Architecturally this is anomaly detection (learn what normal looks like, flag departures from it) plus imbalanced classification, and the same toolkit serves a bank’s transaction monitor and a security team’s intrusion-detection system. Chapter 27 owns the mechanics; the recurring shape is shown below.
flowchart LR
subgraph Recommenders
U[Users] --> I1[Items: products / films / posts]
I1 --> RANK[Rank by predicted relevance]
end
subgraph Security_Fraud
E[Events: payments / logins / packets] --> SC[Score: anomalous?]
SC --> ACT[Block / flag / review]
end
RANK -.same embedding & ranking toolkit.-> SC
The unifying insight: both workhorses are ranking-and-scoring problems over a stream of events, both are brutally imbalanced (most items are irrelevant, most events are benign), both suffer feedback loops (recommending an item makes it more clicked, which makes the model recommend it more; blocking a fraud pattern makes the next attack mutate to evade it), and both are judged by business outcomes — engagement and revenue for recommenders, dollars-saved-minus-friction for fraud — not by raw accuracy. If you have built one well, you already understand the spine of the other.
Here is the shared skeleton both reduce to — score every candidate, then act on the top of the ranking:
import numpy as np
# Both workhorses: score a stream, act on the extreme tail of the ranking.
scores = np.array([0.02, 0.91, 0.10, 0.77, 0.05]) # model's relevance/anomaly score
# recommender: surface the top-k most relevant items
topk = scores.argsort()[::-1][:2] # -> indices [1, 3]
# fraud: flag everything above a risk threshold for review
flagged = np.where(scores > 0.7)[0] # -> indices [1, 3]
print(topk, flagged) # same ranking machinery, different action on the tailThe same “learn normal, flag the outliers” idea has a one-liner home in scikit-learn — IsolationForest is the workhorse anomaly detector for the fraud half of this pattern, no labels required:
import numpy as np
from sklearn.ensemble import IsolationForest
# 1000 "normal" transactions, then 5 injected anomalies far from the cloud
normal = np.random.randn(1000, 3)
anomaly = np.random.randn(5, 3) * 6 + 12
X = np.vstack([normal, anomaly])
iso = IsolationForest(contamination=0.005, random_state=0).fit(X)
pred = iso.predict(X) # -1 = anomaly, +1 = normal
print("flagged anomalies:", int((pred == -1).sum())) # ~5, the injected outliersWhere this shows up, concretely. A streaming service ranks 50,000 candidate titles per user and surfaces the top 20 — that is the recommender’s argsort-then-top-k from above, run a billion times a day. A card network scores every swipe in ~10 ms and routes the riskiest 0.1% to a reviewer — that is the fraud half, the same ranking with a threshold instead of a top-k. Both teams obsess over the same two numbers: how many real positives sit in the tiny slice they act on, and how much friction (skipped-but-good titles, frozen-but-honest cards) that slice costs. Different industries, identical scoreboard.
When scoping a new project in any industry, ask two diagnostic questions: “Is there a user being matched to items here?” (a recommender problem) and “Am I hunting rare bad events in a flood of normal ones?” (an anomaly/fraud problem). A surprising fraction of “novel” requests are one of these two in disguise.
28.4 — Cross-Cutting Lessons
Strip away the domain vocabulary and the same four lessons explain most of why real ML projects succeed or fail. They are the portable wisdom of this chapter.
Data quality beats model cleverness. A logistic regression on clean, well-labeled, leakage-free data routinely outperforms a transformer on a corrupted dataset. The unglamorous work — fixing label noise, reconciling units, handling missing values honestly, removing leaked features — is where most accuracy actually comes from. The diagram below contrasts where effort really goes with where beginners think it goes.
Distribution shift is the rule, not the exception. A model is trained on a snapshot of the world; the world then moves. Three flavors degrade a deployed model silently: covariate shift (the inputs change — a new customer demographic), label shift (the base rates change — fraud rises in a recession), and concept drift (the input-output relationship itself changes — what counts as risky is redefined). The fix is not a one-time event but monitoring: track input distributions and live performance, and retrain on a schedule or a trigger.
Watch the live distribution (pink) slide away from the training distribution (indigo) it was fit on — that growing gap is exactly what a drift alarm like PSI measures:
The cheap, industry-standard drift alarm is the Population Stability Index (PSI), which measures how far a live feature distribution has wandered from its training distribution:
\[ \text{PSI} = \sum_{i} (a_i - e_i)\,\ln\!\frac{a_i}{e_i} \]
In words: chop the feature into bins (say, deciles). In each bin, take how much the live share differs from the training share, and weight it by the log of their ratio so lopsided bins count more. Add up all the bins: 0 means the two distributions are identical, and a bigger number means they have drifted further apart. Also written: it equals \(\text{KL}(e\Vert a) + \text{KL}(a\Vert e)\) — the two KL-divergences added in both directions (the Jeffreys divergence), which is just a symmetric way to score how different the training histogram \(e\) and live histogram \(a\) are.
import numpy as np
# Cheap drift alarm: is this week's feature distribution unlike training's?
def psi(expected, actual, bins=10): # Population Stability Index
q = np.quantile(expected, np.linspace(0,1,bins+1))
q[0], q[-1] = -np.inf, np.inf
e = np.histogram(expected, q)[0]/len(expected) + 1e-6
a = np.histogram(actual, q)[0]/len(actual) + 1e-6
return np.sum((a-e)*np.log(a/e)) # 0 = identical; grows as they diverge
train = np.random.randn(10000)
live = np.random.randn(10000)+0.7 # a shifted mean
print(round(psi(train, live), 3)) # >0.25 => significant shift, investigate
# rule of thumb: PSI <0.1 stable, 0.1-0.25 watch, >0.25 act
assert psi(train, train) < 0.01 # sanity check: no shift against itselfHuman-in-the-loop is a design choice, not a failure. In high-stakes domains the model rarely makes the final call alone — it triages, ranks, and explains so a human decides faster and better. A fraud model that flags 200 transactions for a reviewer, a radiology model that pre-reads and prioritizes the worklist, a credit model that auto-approves the clear cases and routes the borderline ones to an officer: each pairs the model’s throughput with human judgment and accountability.
Evaluate against the business metric. This is the lesson that quietly decides whether a project was worth doing. A model can win on AUC and lose on the thing the business cares about, because offline ML metrics are only proxies for real-world value. The table makes the gap explicit.
| Domain | Tempting ML metric | The metric that actually matters |
|---|---|---|
| Fraud | Accuracy / AUC | Net dollars saved minus customer friction |
| Credit | AUC | Approved-loan profit under regulatory & fairness constraints |
| Medical screening | Accuracy | Sensitivity at an acceptable false-alarm / cost trade-off |
| Recommender | Offline ranking score | Long-term engagement, revenue, retention (measured via A/B test) |
| Trading | Backtest return | Live risk-adjusted return after costs & slippage |
flowchart LR A[Offline metric<br/>AUC / accuracy] -->|necessary but not sufficient| B[Online proxy<br/>CTR / flag rate] B -->|A/B test| C[Business KPI<br/>profit / lives / retention] C -->|feeds back into| A
The chain offline-metric → online-proxy → business-KPI is the spine of responsible applied ML. Each arrow is a place the value can leak out, which is why the gold standard for proving a model helps is a live, controlled A/B test (randomly send some traffic to the new model, the rest to the old, and compare the real business outcome) — not a leaderboard.
The most expensive mistake in applied ML is optimizing the proxy until it diverges from the goal — a classic case of Goodhart’s law (“when a measure becomes a target, it ceases to be a good measure”). A recommender tuned purely for clicks learns to surface clickbait; a fraud model tuned purely for catch-rate freezes legitimate customers. Always keep the true business metric in view.
Before building anything, write down the single business number the project must move and how you will measure it live. If you cannot name that number, you are not ready to train a model — you are ready to talk to a stakeholder.
28.4.1 — Cost-sensitive decisions: where the threshold really comes from
Intuition first: a smoke detector that screams at burnt toast is annoying; one that stays silent during a real fire is deadly. You tune its sensitivity by deciding how much a missed fire costs versus a false alarm. ML thresholds work identically — the “right” cutoff is not 0.5, it is wherever the expected cost is lowest.
Given a class probability \(p\) and the cost of a false positive \(C_{FP}\) versus a false negative \(C_{FN}\), the cost-minimizing decision is to flag when:
\[ p \;>\; \frac{C_{FP}}{C_{FP} + C_{FN}} \]
In words: act on the positive class only when its probability clears a threshold set by how much worse a miss is than a false alarm. If a miss is far costlier, the threshold drops and you flag aggressively. Also written: flag when the expected cost of acting, \(C_{FP}(1-p)\), is below the expected cost of not acting, \(C_{FN}\,p\) — the same break-even condition rearranged.
# Fraud: missing fraud (FN) costs ~$500, a false alarm (FP) costs ~$5 in review time.
C_FP, C_FN = 5, 500
threshold = C_FP / (C_FP + C_FN)
print("flag transactions with fraud-prob >", round(threshold, 4)) # ~0.0099, NOT 0.5
# A 100:1 cost asymmetry pulls the threshold far below the naive 0.5 default.This single formula is why the naive predict() default of 0.5 is almost always wrong in finance and healthcare: it silently assumes false positives and false negatives cost the same, which they essentially never do.
28.5 — Genomics and Computational Biology: Deep Learning on the Sequence of Life
A genome is text. Four letters — A, C, G, T — strung into a sequence three billion characters long in humans, with a grammar we only partially understand. Some stretches code for proteins; far more do not, yet still control when and where genes switch on. The central problem of computational genomics is reading that grammar: given a stretch of DNA, what does it do? Deep learning earns its place here for the same reason it does in language modeling — the input is a long sequence with local motifs and long-range dependencies, and we have lots of it.
The intuition to hold onto: a convolutional or attention model sliding over DNA is doing exactly what it does over text or images. It learns short patterns (a transcription-factor binding site is like a misspelled keyword), composes them, and maps the composition to an outcome (this region is open chromatin; this variant disrupts a regulatory site). The biology supplies the labels; the architecture is mostly borrowed.
flowchart LR A["DNA sequence<br/>(one-hot: 4 x L)"] --> B["Conv layers<br/>learn motifs"] B --> C["Pooling /<br/>dilated conv or attention<br/>long-range context"] C --> D["Output head"] D --> E1["TF binding (DeepBind)"] D --> E2["Chromatin / epigenetic marks (DeepSEA)"] D --> E3["Gene expression tracks (Enformer)"]
One-hot encoding: turning letters into tensors
Before any convolution, DNA becomes numbers. The standard move is one-hot encoding: each base is a length-4 vector, so a window of length \(L\) becomes a \(4 \times L\) matrix.
\[ \text{ACGT} \rightarrow \begin{bmatrix} 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}^{\!\top} \]
In words: replace each letter with a 4-slot switch where exactly one slot is on, marking which of A/C/G/T it is; stacking the switches column by column turns a string into a numeric grid a network can convolve over. Also written: \(x_{b,j} = \mathbb{1}[\,\text{seq}_j = b\,]\) for base index \(b \in \{A,C,G,T\}\) and position \(j\) — a 1 wherever the base at position \(j\) equals row \(b\), else 0.
A 1-D convolution with a filter of width \(w\) then scans this matrix. A single filter is, in effect, a learned position weight matrix — the classic way biologists describe a binding motif — and the convolution score is high exactly where the sequence matches the motif.
The convolution score at position \(i\) is just the overlap between the filter \(W\) and the sequence window sitting under it:
\[ s_i = \sum_{b=1}^{4}\sum_{k=1}^{w} W_{b,k}\, x_{b,\,i+k-1} \]
In words: slide the motif filter along the sequence, and at each spot multiply matching cells and add them up; the sum spikes wherever the window looks like the motif. Also written: \(s_i = \langle W, x_{[:,\,i:i+w]}\rangle_F\) — the Frobenius inner product (elementwise multiply-then-sum) of the filter with the aligned sequence patch.
This is the lineage worth knowing:
| Model | Year (approx.) | Input window | Architecture | Predicts |
|---|---|---|---|---|
| DeepBind | 2015 | ~tens of bp | shallow CNN | TF / RBP binding affinity |
| DeepSEA | 2015 | 1,000 bp | deep CNN | 919 chromatin features (DNase, TF, histone) |
| Basset / Basenji | 2016–18 | up to 100 kb | CNN + dilated conv | chromatin accessibility, expression tracks |
| Enformer | 2021 | ~200 kb | CNN + transformer | thousands of expression / epigenetic tracks |
The arc is one of reach. DeepBind reads a binding site. DeepSEA reads a kilobase and predicts a panel of regulatory states. Enformer attends across 200,000 bases so a model can connect an enhancer to the gene it regulates a hundred kilobases away — a dependency far too long for convolutions alone, which is precisely why attention entered the field.
A from-scratch motif scanner
The cleanest way to feel what these models learn is to skip the training and just score a sequence against a known motif, then see where deep nets improve on it.
import numpy as np
BASES = "ACGT"
def onehot(seq): # L -> 4 x L
idx = {b: i for i, b in enumerate(BASES)}
m = np.zeros((4, len(seq)))
for j, b in enumerate(seq): m[idx[b], j] = 1
return m
# a learned-ish position weight matrix for motif "GAC"
pwm = np.array([ # rows A,C,G,T cols pos1,2,3
[0.1, 0.0, 0.9],
[0.0, 0.8, 0.0],
[0.8, 0.1, 0.1],
[0.1, 0.1, 0.0],
])
seq = "TGACGTGACA"
x = onehot(seq)
w = pwm.shape[1]
scores = [np.sum(x[:, i:i+w] * pwm) for i in range(x.shape[1]-w+1)]
print([round(s, 2) for s in scores])
peak = int(np.argmax(scores))
print("best match at pos", peak, "->", seq[peak:peak+w])
# assert the scanner finds a GAC at the highest-scoring position
assert seq[peak:peak+w].startswith("G")A real CNN learns the PWM from data, stacks many of them, and lets later layers learn that this motif near that motif means something a single filter never could. But the unit of meaning is still the motif — that continuity is why interpretability tools (below) can map a trained network back to biology.
The same scanner as a PyTorch Conv1d
The hand-written sliding sum above is literally a 1-D convolution. In PyTorch you get it for free — and now the filter weights become learnable, which is the only difference between this toy and DeepBind:
import torch, torch.nn as nn
# one-hot sequence as (batch=1, channels=4, length=10)
seq = "TGACGTGACA"
idx = {b: i for i, b in enumerate("ACGT")}
x = torch.zeros(1, 4, len(seq))
for j, b in enumerate(seq): x[0, idx[b], j] = 1.0
conv = nn.Conv1d(in_channels=4, out_channels=1, kernel_size=3, bias=False)
with torch.no_grad(): # plant the same "GAC" motif filter
conv.weight.copy_(torch.tensor([[[0.9, 0.0, 0.1], # A across 3 positions
[0.0, 0.8, 0.1], # C
[0.1, 0.0, 0.8], # G
[0.0, 0.1, 0.1]]])) # T
scores = conv(x).squeeze() # one score per window
print("peak window:", int(scores.argmax())) # where the GAC-like pattern fires
# In a real model these weights are trained by backprop, not hand-set.A genomics CNN is a regex engine that learned its own patterns. Early filters are fuzzy keywords (binding motifs); deeper layers are the grammar combining them. When the model says “this region is active,” the honest follow-up is which learned patterns fired here — and that question has a concrete answer.
28.6 — The AlphaFold Idea: From Sequence to Structure
DNA’s cousin problem lives one level up. A gene’s protein-coding region specifies a chain of amino acids; that chain folds into a 3-D shape; the shape determines what the protein does. Predicting the fold from the sequence — the protein folding problem — sat unsolved for fifty years. AlphaFold2 (2021) brought it within striking distance of experimental accuracy, and the conceptual move is worth understanding even if you never train one.
The key insight is not “a bigger network.” It is evolutionary coupling. Across thousands of species, the same protein appears with mutations. If two amino-acid positions are physically touching in the folded structure, a mutation at one tends to be compensated by a mutation at the other — otherwise the protein breaks and that lineage dies out. So correlated mutations across a multiple-sequence alignment (MSA) are a noisy signal of physical contact.
The doodle below shows the tell: two coupled columns in the alignment light up together, again and again, across species — that synchrony is the contact signal.
flowchart TB A["Query protein sequence"] --> B["Search databases<br/>build MSA<br/>(thousands of homologs)"] B --> C["Evoformer<br/>attention over<br/>sequences × residue-pairs"] C --> D["Structure module<br/>predicts 3-D coordinates"] D --> E["Folded structure<br/>+ per-residue confidence (pLDDT)"] E -.refine.-> C
AlphaFold’s Evoformer passes information back and forth between two representations — one indexed by (sequence, residue) and one by (residue, residue) pairs — so the evolutionary signal in the alignment and the geometric constraints of a structure inform each other. A final structure module turns the refined pair representation into atomic coordinates.
A worked micro-example of the coupling signal, no deep learning required:
import numpy as np
# 4 aligned homologs, 5 residue positions (toy)
msa = np.array([
list("MKVLA"),
list("MRVLD"), # pos1 K->R AND pos4 ... track pos 1 & 4 together
list("MKVLA"),
list("MRVLD"),
])
def col(j): return msa[:, j]
# mutual information between two columns ~ "do they vary together?"
def mutinfo(a, b):
mi = 0.0
for x in set(a):
for y in set(b):
pxy = np.mean((a == x) & (b == y))
px, py = np.mean(a == x), np.mean(b == y)
if pxy > 0: mi += pxy * np.log(pxy / (px * py))
return mi
print("MI(pos1,pos4) =", round(mutinfo(col(0), col(4)), 3)) # high: they co-vary
print("MI(pos1,pos2) =", round(mutinfo(col(0), col(1)), 3)) # low
# high MI => likely in contact in the folded structureThe mutual information this code computes has a tidy definition worth pinning down, because it is the literal measure of “do these two columns vary together”:
\[ I(X;Y) = \sum_{x}\sum_{y} p(x,y)\,\log\frac{p(x,y)}{p(x)\,p(y)} \]
In words: compare how often two positions take each pair of values together against how often they would by pure chance if independent; the more the joint pattern beats the independent guess, the more they are coupled. Also written: \(I(X;Y) = H(X) + H(Y) - H(X,Y)\) — the entropy of each column minus their joint entropy, i.e. the uncertainty you remove about one position by knowing the other.
Positions 1 and 4 mutate in lockstep (K/A together, R/D together); the model treats that as evidence they sit close in 3-D. AlphaFold replaces this crude mutual-information statistic with learned attention, but the biology it exploits is the same. The practical payoff: a near-complete catalog of predicted human protein structures, each annotated with a per-residue confidence (pLDDT) — which matters enormously, because a structure you cannot trust is worse than no structure. Confidence is a first-class output, not an afterthought.
AlphaFold predicts a static, folded structure for a single chain well; it is far weaker on disordered regions, on the effect of a single point mutation, and on dynamics or binding. Treating a confident fold as a confident functional answer is the most common misuse. pLDDT tells you the model trusts the geometry — not that the geometry answers your biological question.
28.7 — Variant-Effect Prediction: Does This Mutation Matter?
Now the question that touches patients. A person’s genome differs from the reference at millions of positions. Most differences are harmless. A few cause disease. Variant-effect prediction asks, for a given change, how likely is it to be functionally consequential?
The deep-learning framing is elegant and reuses everything above. Take a trained sequence model — DeepSEA, Enformer, a protein language model. Feed it the reference sequence and read its prediction. Feed it the mutated sequence and read again. The difference in predictions is the model’s estimate of the variant’s effect. No labeled disease examples required for the base signal — the model was trained to predict regulatory or structural state, and the variant’s effect falls out of the gap.
flowchart LR R["Reference sequence<br/>...A C [G] T A..."] --> M1["Trained model"] V["Variant sequence<br/>...A C [T] T A..."] --> M2["Trained model"] M1 --> P1["pred_ref<br/>(e.g. chromatin openness)"] M2 --> P2["pred_var"] P1 --> D["effect = | pred_var - pred_ref |"] P2 --> D D --> S["score → prioritize / flag"]
For coding variants the analog is a protein language model (ESM-style): score the reference amino-acid sequence and the mutant, and a variant the model finds “surprising” (low probability) is more likely to be deleterious. For non-coding variants — the vast majority, and the ones classical methods handled worst — a regulatory model like Enformer predicts whether the change shifts a gene’s expression.
The protein-language-model version of “surprise” has a clean formula — it is just the log-likelihood gap between the mutant and reference amino acid at the changed position:
\[ \Delta = \log p(x_{\text{mut}} \mid \text{context}) - \log p(x_{\text{ref}} \mid \text{context}) \]
In words: ask the model how natural the new amino acid looks versus the original one in the same surroundings; a strongly negative score means the mutation is one the model rarely sees in real proteins, hinting it is damaging. Also written: \(\Delta = \log\dfrac{p(x_{\text{mut}}\mid \text{context})}{p(x_{\text{ref}}\mid \text{context})}\) — the log-likelihood ratio of mutant to reference under the language model.
A tiny illustration of the difference-of-predictions recipe, with a stand-in scoring function:
import numpy as np
# stand-in for a trained model: scores "regulatory activity" of a window.
# real model = DeepSEA/Enformer; here a toy that rewards the motif "GATA".
def model_score(seq):
return sum(seq[i:i+4] == "GATA" for i in range(len(seq)-3))
ref = "CCGATACC"
var = "CCGAGACC" # single base G->... breaks the GATA motif
effect = model_score(var) - model_score(ref)
print("ref:", model_score(ref), "var:", model_score(var), "effect:", effect)
assert effect < 0 # variant destroys a regulatory motif -> flaggedThe variant erases a binding motif, the predicted activity drops, the effect score is non-zero — that is the entire logic, scaled up to real networks and thousands of output tracks.
A real protein-LM variant score with ESM
The toy above becomes real with three lines of Hugging Face Transformers: load a pretrained protein language model (ESM-2), read the per-position amino-acid probabilities, and take the log-likelihood gap for the mutation.
import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM
tok = AutoTokenizer.from_pretrained("facebook/esm2_t6_8M_UR50D")
esm = AutoModelForMaskedLM.from_pretrained("facebook/esm2_t6_8M_UR50D").eval()
seq = "MKTAYIAKQR" # toy wild-type protein; mutate position 5 (A) -> G
pos, ref_aa, mut_aa = 5, "A", "G"
ids = tok(seq, return_tensors="pt")
with torch.no_grad():
logp = torch.log_softmax(esm(**ids).logits, dim=-1)[0] # (L+specials, vocab)
tpos = pos + 1 # +1 for the leading <cls> token
score = (logp[tpos, tok.convert_tokens_to_ids(mut_aa)]
- logp[tpos, tok.convert_tokens_to_ids(ref_aa)]).item()
print("variant log-likelihood ratio:", round(score, 3)) # very negative => likely deleteriousThis is exactly how ESM-1v and similar zero-shot variant predictors work in practice — no disease labels, just the language model’s sense of what a “natural” protein looks like.
What makes this hard is not the recipe but the evaluation. A model that scores variants must be checked against ground truth — known pathogenic vs. benign variants, or experimental measurements from a massively parallel reporter assay (MPRA) where thousands of variants are tested in the lab at once. And the metric must respect that pathogenic variants are rare: accuracy is meaningless when 99.9% of variants are benign.
| Metric | Why it matters for variant prediction |
|---|---|
| AUROC | overall ranking, but optimistic under heavy class imbalance |
| AUPRC | far more honest when positives (pathogenic) are rare |
| Calibration | a “0.9 pathogenic” score must mean ~90% truly are |
| Enrichment in known genes | do top-scored variants land in disease genes? |
Rule of thumb: a zero-shot variant score (reference-minus-mutant under a sequence model) is a prior, not a verdict. It ranks candidates cheaply across the whole genome, but before any clinical use it must be recalibrated and checked against labelled pathogenic/benign sets — the language model knows what proteins look like, not what hurts a patient.
28.8 — Regulatory-Motif Discovery and Why Interpretability Is the Product
A genomics model that predicts accurately but explains nothing is, in biomedicine, only half-built. The reason is not aesthetic. A biologist wants to know which sequence features drive a prediction, because that is the testable hypothesis — the thing you can knock out in the lab to confirm. Interpretability here is not a sanity check bolted on at the end; it is the deliverable.
The workhorse is attribution: for each base in the input, how much did it contribute to the output? Methods like in-silico mutagenesis (mutate each base, measure the prediction change) or gradient-based attributions (DeepLIFT, integrated gradients) produce a per-base importance score. Plotting those scores as a letter-height “sequence logo” reveals the motifs the model actually used.
One widely used attribution, integrated gradients, has an intuitive reading: walk slowly from a blank baseline up to the real input and add up how sensitive the output was along the way.
\[ \text{IG}_i = (x_i - x_i')\int_{0}^{1} \frac{\partial f\big(x' + \alpha(x - x')\big)}{\partial x_i}\,d\alpha \]
In words: start from a neutral baseline sequence, gradually morph it into the real one, and accumulate each base’s influence on the prediction over that path; the total is how much that base mattered. Also written: \(\text{IG}_i \approx (x_i - x_i')\cdot\frac{1}{m}\sum_{k=1}^{m}\frac{\partial f(x' + \frac{k}{m}(x-x'))}{\partial x_i}\) — the integral replaced by an average over \(m\) small steps along the baseline-to-input line.
The problem: a single sequence’s attributions are noisy. The same true motif shows up in thousands of windows, each slightly different and each with messy per-base scores. TF-MoDISco (Transcription Factor Motif Discovery from Importance Scores) solves this by aggregating. It scans attributions across the whole dataset, extracts the high-importance subsequences (“seqlets”), clusters them, and averages each cluster into a clean, consensus motif. The pipeline turns a haystack of noisy per-base scores into a short list of recurring patterns you can match against databases of known transcription-factor binding sites.
flowchart LR A["Trained model"] --> B["Per-base attributions<br/>over many sequences"] B --> C["Extract seqlets<br/>(high-importance stretches)"] C --> D["Cluster similar seqlets"] D --> E["Average → consensus motifs"] E --> F["Match to known TF database<br/>(novel motif = hypothesis)"]
When a TF-MoDISco motif matches a known transcription factor, the model has rediscovered real biology from scratch — strong evidence it learned signal, not artifact. When it surfaces a motif no database knows, that is a genuine hypothesis worth an experiment. Either way, the attribution-to-motif loop is how a black-box predictor becomes a scientific instrument.
Validation and calibration are non-negotiable
Two failure modes haunt this field, and both are about trust, not accuracy.
First, data leakage through the genome’s structure. Nearby genomic regions are correlated, and the same biological elements recur. Split your train and test sets by random window and you will leak — a near-identical region sits on both sides, and your reported accuracy is fiction. The discipline is to split by chromosome: hold out entire chromosomes for testing so no region near a training example can sneak into evaluation. The same logic forbids letting two homologous proteins straddle the train/test boundary in folding work.
Random train/test splits are the single most common way genomics deep-learning papers overstate performance. Because regulatory elements and homologs recur across the genome, a random split places near-duplicates on both sides and the model partly memorizes the test set. Always split by chromosome (or by sequence-identity clusters for proteins), and report performance on held-out chromosomes only.
Second, uncalibrated confidence in a clinical setting. A variant-effect model that outputs 0.9 must be right about 90% of the time at that score, or a clinician acting on it makes systematic errors. This is why AlphaFold ships pLDDT, why variant scores are checked for calibration and not just AUROC, and why the honest output of a biomedical model is a calibrated probability with an interpretable explanation, never a bare number. The cost of a confidently wrong prediction is measured in misdiagnoses and wasted experiments — so validation against held-out biology, calibrated uncertainty, and attribution-based interpretability are the three things that turn an impressive model into a usable one.
| Requirement | Why it is non-negotiable in biomedicine |
|---|---|
| Chromosome-level splits | random splits leak via genomic correlation; inflate accuracy |
| Calibration | a clinician acts on the probability, not the rank |
| Interpretability (attributions, TF-MoDISco) | the explanation is the testable hypothesis |
| External validation (MPRA, held-out cohorts) | in-silico accuracy must survive contact with the wet lab |
Measuring calibration in practice
Intuition: a well-calibrated model is like an honest weather forecaster — when it says “70% chance of rain” across many days, it should actually rain on about 70% of them. We quantify the gap with Expected Calibration Error (ECE): bin predictions by confidence, and in each bin compare the average confidence to the actual hit rate.
\[ \text{ECE} = \sum_{m=1}^{M} \frac{|B_m|}{n}\,\big|\,\text{acc}(B_m) - \text{conf}(B_m)\,\big| \]
In words: group predictions into confidence buckets, and for each bucket measure how far its claimed confidence sits from its real accuracy; average those gaps, weighted by how many predictions fall in each bucket. Also written: \(\text{ECE} = \mathbb{E}_{\hat p}\big[\,|\,P(\text{correct}\mid \hat p) - \hat p\,|\,\big]\) — the expected absolute gap between predicted confidence and true correctness probability.
import numpy as np
def ece(probs, labels, bins=10):
edges = np.linspace(0, 1, bins + 1)
total = 0.0
for lo, hi in zip(edges[:-1], edges[1:]):
m = (probs > lo) & (probs <= hi)
if m.sum() == 0:
continue
conf, acc = probs[m].mean(), labels[m].mean()
total += (m.sum() / len(probs)) * abs(acc - conf)
return total
rng = np.random.default_rng(0)
probs = rng.uniform(0, 1, 2000)
labels = (rng.uniform(0, 1, 2000) < probs).astype(int) # perfectly calibrated by construction
print("ECE (well-calibrated):", round(ece(probs, labels), 3)) # near 0
print("ECE (overconfident): ", round(ece(probs**0.5, labels), 3)) # inflated -> larger gapA low ECE is what lets a clinician read a model’s “0.9” as a genuine 90% — the difference between a number and a trustworthy number.
28.9 — Quick reference
| Term / formula | What it means | When / why it matters |
|---|---|---|
| Non-stationarity | The feature→target relationship drifts over time | The core hazard in finance; forces time-aware validation and retraining |
Walk-forward split (TimeSeriesSplit) |
Train only on the past, test on the genuine future | The only honest evaluation when data is time-ordered |
| Lookahead bias | Future information leaks into training | The deadliest finance bug — a brilliant backtest that loses money live |
| VaR\(_\alpha\) | The loss level not exceeded on \(\alpha\) of days | Sizes positions / regulatory capital; ignores how bad the tail gets |
| Expected Shortfall | Average loss beyond the VaR cutoff | Tail-aware risk; Basel’s preferred measure post-2008 |
| Sharpe ratio | Excess return ÷ volatility, annualized by \(\sqrt{T}\) | Risk-adjusted reward; backtests >2 usually signal a leak |
| Sensitivity / Specificity / PPV | Caught / cleared / truly-sick-among-flagged | Replace accuracy on imbalanced medical data |
| Bayes’ rule for PPV | PPV depends on prevalence, not just test quality | Why a great test still floods rare-disease screening with false alarms |
| External validation | Test on a genuinely different site/source | Catches models that learned the scanner, not the disease |
| PSI | Symmetric (Jeffreys) distance between train & live histograms | Cheap drift alarm: <0.1 stable, 0.1–0.25 watch, >0.25 act |
| Distribution shift (covariate / label / concept) | Inputs, base rates, or the input→output rule change | The default in production; demands monitoring, not a one-time fit |
| Cost-sensitive threshold \(C_{FP}/(C_{FP}+C_{FN})\) | Flag when probability clears a cost-set cutoff | Why 0.5 is almost always wrong when miss vs. false-alarm costs differ |
| One-hot DNA → Conv1d | Each base a 4-vector; a filter is a learned motif (PWM) | The base unit of genomics deep learning |
| Evolutionary coupling (MSA mutual info) | Co-varying alignment columns hint at 3-D contact | The signal AlphaFold exploits to fold proteins |
| Variant effect = pred(var) − pred(ref) | Difference in a sequence model’s output | Zero-shot scoring of mutations; no disease labels needed |
| Chromosome-level split | Hold out whole chromosomes for test | Random splits leak via genomic correlation and inflate accuracy |
| ECE | Mean gap between claimed confidence and real accuracy | Turns a model’s “0.9” into a trustworthy 90% for clinical use |
| Goodhart’s law | A targeted measure stops being a good measure | Why optimizing a proxy (clicks, catch-rate) corrupts the real goal |
| A/B test | Randomized live comparison of old vs. new model | Gold standard for proving a model moves the business metric |
28.10 — Key takeaways
- The hard part of applied ML is almost never the model — it is data quality, distribution shift, regulation, and aligning evaluation with the real business goal.
- Finance is dominated by non-stationarity and regulation: use walk-forward (time-ordered) validation, fear lookahead bias, prefer interpretable models where the law demands explanations, and size risk with tail-aware measures (VaR and Expected Shortfall).
- Healthcare raises the stakes: accuracy is misleading on rare diseases (use sensitivity/specificity/PPV, and remember PPV follows Bayes’ rule on prevalence), models must be externally validated on new sites, and bias plus interpretability are non-negotiable.
- Recommenders and security/fraud are cross-industry workhorses — ranking-and-scoring over imbalanced event streams with feedback loops — whose patterns transfer across domains.
- Expect distribution shift by default; monitor it (e.g. PSI) and retrain. Use human-in-the-loop for high-stakes decisions, and set thresholds from the cost asymmetry, not the default 0.5.
- Genomics & biology reuse the sequence-model toolkit (one-hot → CNN/attention) but demand chromosome-level splits, calibrated confidence, and attribution-based interpretability — the explanation is the deliverable.
- Always evaluate against the business metric, ideally via a live A/B test, and beware Goodhart’s law when optimizing proxies.
28.11 — See also
- Anomaly & Fraud Detection — the mechanics behind the security/fraud workhorse.
- Recommender Systems — the full treatment of the recommendation workhorse.
- Time Series & Forecasting — walk-forward validation and non-stationarity in depth.
- Model Evaluation & Tuning — metrics, confusion matrices, calibration, and validation strategy.
- Explainable AI & Interpretability — the interpretability demanded by finance and healthcare.
- AI Ethics, Fairness & Safety — bias, fairness constraints, and high-stakes accountability.
- MLOps & Deployment — monitoring, drift detection, and retraining in production.
- Causal Inference — A/B testing and reasoning about interventions versus correlations.
↪ The thread continues → Chapter 29 · 🔧 MLOps & Deployment
Building a model that works in a notebook is the easy 10%. Shipping it so it keeps working — versioned, monitored, retrained — is MLOps, where the other 90% lives.
📖 All chapters | ← 27 · 🚨 Anomaly & Fraud Detection | 29 · 🔧 MLOps & Deployment →