Chapter 27 — 🚨 Anomaly & Fraud Detection

📖 All chapters | ← 26 · 🛒 Recommender Systems | 28 · 🏦 ML Across Industries →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Anomaly detection is the art of finding the needle that does not belong in the haystack: a fraudulent charge among millions of honest ones, a hacked login among normal sessions, a failing bearing among healthy machines. It sits at the applied edge of machine learning, drawing on statistics, density estimation, and deep learning, but it carries its own brutal twist — the thing you most want to catch is rare, often unlabeled, and constantly changing as adversaries adapt. This chapter builds the toolkit from simple statistical alarms up to modern detectors, always keeping one eye on the realities of fraud and security.

🧭 In context: Applied ML Systems · used to flag rare, costly, or malicious events (fraud, intrusion, faults, disease) · the one key idea — model what “normal” looks like, then measure how far each new point strays from it.

💡 Remember this: You can almost never enumerate the anomalies, so model what normal looks like, score how far each new point strays from it, and pin the alert threshold to your alert budget.

27.1 — The Problem: Rare Events, Extreme Imbalance, Few or No Labels

An anomaly (or outlier) is an observation that deviates so much from the rest that it was likely generated by a different process than the one that produced the bulk of the data. That definition is easy to state; the difficulty is why anomaly detection is hard, not what an anomaly is.

The first difficulty is rarity. Card fraud runs around 0.1% of transactions; in network intrusion or manufacturing faults the positive rate can be one in a million. A model that simply predicts “normal” for everything is 99.9% accurate and 100% useless. This is the extreme class imbalance problem, and it breaks every habit trained on balanced data — accuracy becomes meaningless, default 0.5 thresholds make no sense, and naive resampling distorts the very rarity you are trying to model.

The second difficulty is that labels are scarce or absent. Nobody hands you a clean column saying “this was fraud.” Labels arrive late (a chargeback lands 60 days after the transaction), are noisy (a customer forgets a purchase and disputes it), and cover only the fraud you already caught — the cleverest attacks are, by definition, unlabeled. So we often cannot train a standard classifier at all; instead we must learn the shape of “normal” and treat departures from it as suspicious.

The third difficulty is that the anomalies are neither uniform nor static. There is no single “fraud signature” to memorize, and the patterns shift weekly as fraudsters probe your defenses. The target is adversarial and drifting, so a detector frozen in time decays fast.

flowchart TD
    A[New observation] --> B{How well does it<br/>fit 'normal'?}
    B -->|Fits| C[Pass: ignore]
    B -->|Strays far| D[Flag: alert / review]
    D --> E[Analyst or chargeback<br/>eventually labels it]
    E -.late, noisy feedback.-> B

The common thread across fraud, intrusion, equipment faults, and rare-disease screening is the same statistical shape: a dense mass of normal points and a thin, ill-defined scattering of anomalies you cannot fully enumerate. Because you can describe “normal” far better than you can describe “bad,” the whole field tilts toward modeling normality.

Tip

Rule of thumb: if you have plenty of clean labels for both classes, treat the task as imbalanced classification (Chapter 9). Reach for anomaly detection methods when the positives are rare, undefined, or unlabeled — when you can describe “normal” far better than you can describe “bad.”

27.2 — Statistical Methods: z-score, Gaussian, EWMA

The oldest idea in the book is also one of the most effective: fit a simple probability model to “normal,” then flag whatever is improbable under it. If you can write down what typical data looks like, anything sitting in the far tail is your anomaly.

Start with the z-score, the workhorse for a single roughly bell-shaped feature. Standardize each point by how many standard deviations it sits from the mean, $z = (x - \mu)/\sigma$, and flag points beyond a cutoff such as $|z| > 3$ (about 0.3% of a true Gaussian). As a worked example, suppose daily logins average $\mu = 200$ with $\sigma = 25$. A day with 400 logins gives $z = (400-200)/25 = 8$ — wildly anomalous, the fingerprint of a likely credential-stuffing attack.

In words: the z-score says how many standard-deviation steps a point is above or below the average — big steps mean strange. Also written: $z = \dfrac{x-\mu}{\sigma}$, equivalently $x = \mu + z\,\sigma$ (the point reconstructed from its score).

Real data, though, has many correlated features, and that is where the multivariate Gaussian earns its keep. Fit a mean vector $\mu$ and a covariance matrix $\Sigma$, then score each point by its Mahalanobis distance, which stretches and rotates the notion of “far” to follow the data’s natural correlations:

\[D_M(x) = \sqrt{(x-\mu)^\top \Sigma^{-1} (x-\mu)}\]

In words: measure distance from the center, but first squeeze the axes so that directions where the data naturally spreads out count for less — it is “how surprising is this point given how the features usually move together.” Also written: $D_M(x) = \lVert \Sigma^{-1/2}(x-\mu)\rVert_2$, the ordinary Euclidean distance after whitening the data; for a single feature it collapses to $D_M = |z|$.

The intuition is that a point can look perfectly normal on every single axis yet be anomalous jointly. A high transaction amount is fine on its own, late-night timing is fine on its own, but that amount at that hour may be very rare. Mahalanobis distance catches exactly this joint oddity; per-axis z-scores miss it entirely. The diagram below shows the geometry: the red dot is comfortably within range on each axis, yet far off the correlation ellipse that defines what “together” looks like.

For data that arrives over time, “normal” itself drifts, and a fixed $\mu$ goes stale within days. The fix is the EWMA (exponentially weighted moving average), a running estimate that weights recent points more heavily: $s_t = \alpha x_t + (1-\alpha)s_{t-1}$, where the smoothing factor $\alpha \in (0,1)$ controls how fast the model forgets. Track the smoothed mean alongside a smoothed variance, and raise an alarm when a new point falls outside a band such as $s_{t-1} \pm 3\sigma_t$. This single recurrence is the quiet backbone of streaming dashboards and security-operations alerting.

In words: the new running average is a blend — a slice $\alpha$ of the latest reading plus the rest from yesterday’s average — so old data fades away smoothly instead of being remembered forever. Also written: expanding the recurrence gives a geometric weighting $s_t = \alpha\sum_{i=0}^{t}(1-\alpha)^i x_{t-i}$, where the weight on a point $i$ steps in the past decays as $(1-\alpha)^i$.

The band below tracks a live signal: a shaded $\pm 3\sigma$ corridor follows the drifting EWMA, normal jitter stays inside it, and the lone spike that pokes through the top is exactly what fires the alarm.

import numpy as np
def ewma_detect(x, alpha=0.3, k=3.0):
    s, v, alerts = x[0], 0.0, []          # running mean, variance, hit-list
    for t in range(1, len(x)):
        if abs(x[t]-s) > k*np.sqrt(v+1e-9):  # outside k-sigma band?
            alerts.append(t)
        diff = x[t]-s
        s += alpha*diff                    # update smoothed mean
        v = (1-alpha)*(v + alpha*diff**2)  # update smoothed variance
    return alerts
# spike at index 5 is flagged; gradual drift is absorbed
print(ewma_detect(np.array([10,11,9,10,12,40,11,10.])))  # -> [5]

In practice you rarely hand-roll the multivariate Gaussian — scikit-learn ships it as a robust covariance estimator that returns Mahalanobis distances directly, and it down-weights contamination in the “normal” set as it fits:

from sklearn.covariance import EllipticEnvelope
import numpy as np
X = np.vstack([np.random.randn(500, 2), [[6, 6]]])   # 500 normal + 1 outlier
ee = EllipticEnvelope(contamination=0.01, random_state=0).fit(X)
print(ee.predict(X)[-1])            # -1 -> the [6,6] point flagged
print(ee.mahalanobis(X[-1:]))       # large squared Mahalanobis distance

Warning

Z-scores assume a Gaussian. On skewed, heavy-tailed data — transaction amounts, packet sizes, file sizes — the “$3\sigma$” rule either fires constantly or never. Log-transform the feature first, or switch to robust statistics: use the median and the MAD (median absolute deviation) in place of the mean and $\sigma$, since a few extreme values cannot drag them around.

The robust version of the z-score is worth spelling out, because it is the one you should actually reach for on messy real-world features. Replace the mean with the median and $\sigma$ with a scaled MAD, $\text{MAD} = \text{median}(|x_i - \text{median}(x)|)$:

\[z_{\text{robust}} = \frac{x - \text{median}(x)}{1.4826 \cdot \text{MAD}}.\]

In words: the same “how many steps from the middle” idea, but using the median and the typical gap from the median — quantities a handful of wild values cannot hijack. Also written: since $1.4826 \cdot \text{MAD} \to \sigma$ for Gaussian data, this is just $z_{\text{robust}} = (x - \tilde{\mu})/\hat{\sigma}_{\text{MAD}}$ with robust drop-in estimates $\tilde{\mu}, \hat{\sigma}_{\text{MAD}}$.

27.3 — Distance and Density Methods: kNN, LOF, DBSCAN

When the data refuses to be neatly Gaussian, drop the distribution assumption entirely and let geometry do the talking. The governing intuition is simple: normal points huddle together in dense regions, while anomalies sit alone in sparse, lonely corners of feature space.

The most direct expression of that idea is the kNN distance. Score each point by the distance to its $k$-th nearest neighbor (or by the average distance to its $k$ neighbors). Normal points sit inside dense clusters with close neighbors and score low; outliers are far from everyone and score high. It is simple and surprisingly strong, but it has one blind spot — it assumes a single global density across the whole dataset.

That blind spot is exactly what LOF (Local Outlier Factor) repairs. Real data contains clusters of different densities: a sparse-but-legitimate region of large corporate transactions sitting beside a dense region of small retail ones. A single global distance threshold would wrongly condemn the entire sparse cluster. LOF sidesteps this by comparing each point’s local density to the local densities of its neighbors. Roughly,

\[\text{LOF}(p) = \frac{\text{average local density of } p\text{'s neighbors}}{\text{local density of } p}.\]

In words: divide how crowded your neighbors’ neighborhoods are by how crowded yours is — if your neighbors live in packed streets while you sit on an empty lane, the ratio shoots up and you are a local outlier. Also written: with local reachability density $\text{lrd}$, $\text{LOF}(p) = \dfrac{1}{|N_k(p)|}\sum_{o \in N_k(p)} \dfrac{\text{lrd}(o)}{\text{lrd}(p)}$, the mean of the neighbors’ density ratios.

A value of $\text{LOF}(p) \approx 1$ means “about as dense as my neighborhood” (normal), while $\text{LOF}(p) \gg 1$ means “I sit in a far sparser pocket than my neighbors do” (a local outlier). The picture below makes it concrete: the red point is an outlier relative to its own tight cluster, even though, in absolute distance, it is closer to that cluster than the loose cluster’s members are to one another.

The third tool, DBSCAN, is a clustering algorithm (Chapter 11) that moonlights as an outlier detector. It grows clusters outward from dense core points — those with at least min_samples neighbors inside a radius eps — and absorbs everything density-reachable from them. Whatever cannot join any dense region is left over and labeled noise (label $-1$). Those leftovers are your anomalies, delivered for free, with no need to specify the number of clusters in advance.

from sklearn.neighbors import LocalOutlierFactor
import numpy as np
X = np.vstack([np.random.randn(100,2), [[6,6]]])  # 100 normal + 1 outlier
lof = LocalOutlierFactor(n_neighbors=20)
y = lof.fit_predict(X)          # -1 = outlier, 1 = inlier
print("flagged:", np.where(y==-1)[0])  # includes index 100, the [6,6] point

All three methods lean on a meaningful distance, and that is their Achilles’ heel. In high dimensions the curse of dimensionality flattens every pairwise distance toward the same value, so “far” stops meaning anything, and the naive algorithms scale as $O(n^2)$. The standard remedies are to standardize the features first and to reduce dimensionality before measuring distances (Chapter 7).

27.4 — Isolation Forest

Almost every method so far first models normality and then measures deviation from it. Isolation Forest flips that logic on its head with a beautifully lazy insight: anomalies are few and different, which means they are easy to isolate. So instead of describing normal and checking who departs from it, just see how little effort it takes to fence each point off by itself.

The mechanism is randomized partitioning. Pick a random feature, pick a random split value between its min and max, and repeat — each split carves the space into smaller boxes, building a random tree. A weird, isolated point gets fenced into its own box after only a handful of cuts; a normal point buried deep in a dense cluster needs many cuts before it is finally alone. The anomaly score is therefore the average path length (the number of splits) needed to isolate a point, averaged over many random trees. A short path means anomalous.

Raw path length is awkward to compare across datasets — trees get deeper as you add more points — so we divide each point’s path length by $c(n)$, the typical path length in a tree of $n$ points. That turns it into a clean 0-to-1 score: near 1 means “isolated far faster than usual” (anomaly), near 0.5 means “took about the usual number of cuts” (normal).

The score has a tidy closed form. With $E[h(x)]$ the mean path length of point $x$ across the forest and $c(n)$ the expected path length in a tree of $n$ points,

\[s(x, n) = 2^{-\,\frac{E[h(x)]}{c(n)}}.\]

In words: turn “how few cuts did it take to isolate this point” into a 0-to-1 score — a very short path drives the exponent toward 0 and the score toward 1 (anomaly); a long path pushes the score toward 0.5 or below (normal). Also written: equivalently $s = 2^{-E[h(x)]/c(n)}$ with $c(n) = 2H(n-1) - \tfrac{2(n-1)}{n}$ and $H(i)$ the $i$-th harmonic number; $s \to 1$ as $E[h(x)] \to 0$ and $s \to 0$ as $E[h(x)] \to n-1$.

flowchart TD
    R[Random split on random feature] --> N1[normal point:<br/>still mixed in]
    R --> A1[anomaly:<br/>already alone ✂️]
    N1 --> N2[split again]
    N2 --> N3[split again...]
    N3 --> NN[isolated after many cuts<br/>= long path = normal]
    A1 --> AA[isolated in 1-2 cuts<br/>= short path = anomaly]

The animation below shows the core intuition in miniature: a single random slice (the sweeping dashed line) drops right between the dense cluster and the lone outlier, fencing the outlier off in one cut while the cluster still needs many more.

To see why it works, take the tiny dataset $\{10, 11, 9, 10, 12, \mathbf{90}\}$. A single random cut placed anywhere between 12 and 90 immediately isolates the 90 — path length 1 — whereas teasing the 9 apart from the surrounding 10s takes several lucky cuts in a row. Averaged over 100 trees, the 90 ends up with by far the shortest mean path, and so the highest anomaly score.

The reason Isolation Forest is a genuine workhorse is operational, not just conceptual. It runs in linear time with low memory (each tree is built on a small subsample, 256 points by default), it needs no distance metric at all, and it handles high-dimensional data far more gracefully than LOF. For tabular anomaly detection it is the sensible default first try.

from sklearn.ensemble import IsolationForest
import numpy as np
X = np.vstack([np.random.randn(500,3), [[8,8,8]]])
iso = IsolationForest(contamination=0.01, random_state=0).fit(X)
print(iso.predict(X)[-1])      # -1 -> the [8,8,8] point flagged

Warning

The contamination parameter sets the expected fraction of anomalies and therefore the alert threshold. Set it wrong and you systematically over- or under-alert. Prefer ranking points by the raw score (score_samples) and then choosing the cutoff from validation data, rather than trusting a guessed contamination rate.

27.5 — One-Class SVM

When you happen to have a clean sample of only normal data and want to wrap a tight boundary around it, the One-Class SVM is the classic instrument. It is the support-vector machine (Chapter 9) bent into a one-class world: rather than separating two labeled classes, it learns a single frontier that encloses the normal data, treating the origin as the lone “outlier” that everything else should be pushed away from.

Using the kernel trick — typically an RBF kernel — it maps the points into a high-dimensional space and finds the maximum-margin hyperplane that separates the data from the origin there. Back in the original feature space, that hyperplane corresponds to a flexible, possibly non-convex contour drawn snugly around the normal region. Any new point that lands outside the contour scores as an anomaly.

The key knob is $\nu \in (0,1]$, which plays a double role: it is an upper bound on the fraction of training points allowed to fall outside the boundary (treated as outliers or noise) and simultaneously a lower bound on the fraction of points retained as support vectors. A small $\nu$ wraps the data tightly; a large $\nu$ is more willing to reject points as anomalous.

from sklearn.svm import OneClassSVM
import numpy as np
X_norm = np.random.randn(400, 2)                 # clean "normal" training data
oc = OneClassSVM(kernel="rbf", gamma="scale", nu=0.05).fit(X_norm)
X_test = np.vstack([np.random.randn(5, 2), [[5, 5]]])
print(oc.predict(X_test))     # +1 = inside frontier, -1 = anomaly ([5,5])

The catch is practical. One-Class SVM is sensitive to the choice of kernel and the gamma setting, it scales poorly past tens of thousands of points (roughly $O(n^2)$ to $O(n^3)$), and it assumes the training data is genuinely clean — a few hidden anomalies in the “normal” set distort the boundary. On large or contaminated tabular data, Isolation Forest usually wins on speed and robustness; One-Class SVM shines on smaller, carefully curated sets of normal data. When you need the same boundary idea but at scale, sklearn.linear_model.SGDOneClassSVM gives a linear-time approximation that streams through the data.

Warning

The “clean normals only” promise quietly fails if your training set is not clean. Because a single hidden anomaly can pull the RBF boundary out to enclose it, One-Class SVM trained on lightly contaminated data learns a frontier loose enough to wave the next anomaly straight through. Scrub the training set, or raise $\nu$ to let the model treat some training points as outliers it is allowed to exclude.

27.6 — Reconstruction-Based Methods: Autoencoder Error

Once the data is high-dimensional or richly structured — images, windows of sensor readings, sequences of events — the geometry-based methods stall, and deep learning takes over. An autoencoder (Chapter 18) is a neural network trained to copy its input to its output through a deliberately narrow bottleneck. That squeeze is the whole point: forced to pass everything through a few numbers, the network can only succeed by learning the underlying regularities of the data and discarding the rest.

The trick that turns it into a detector is to train it only on normal data. It becomes an expert at reconstructing normal patterns and a klutz at everything else. Feed it an anomaly whose shape it has never seen, and it reconstructs the input badly. The reconstruction error $\lVert x - \hat{x}\rVert^2$ is therefore the anomaly score, and you raise an alert whenever it exceeds a threshold.

In words: measure how far the network’s redrawn version is from the original — small for the familiar normal shapes it was trained on, large for anything strange it has never learned to draw. Also written: $\lVert x - \hat{x}\rVert^2 = \sum_{j}(x_j - \hat{x}_j)^2$, the summed squared per-feature gap (the squared Euclidean distance between input and reconstruction).

flowchart LR
    X[input x] --> E[encoder] --> Z[bottleneck z] --> D[decoder] --> Xh[reconstruction x̂]
    Xh --> L["error = ‖x − x̂‖²<br/>large ⇒ anomaly"]

As a worked example, picture an autoencoder trained on vibration windows from a healthy machine. It learns to rebuild their smooth periodic shape with a reconstruction error of around 0.02. Now feed it a window from a cracked gear, which contains a sharp spike the network has never encountered. It smooths the spike away as if it were noise, the reconstruction misses badly, and the error jumps to 0.5 — far above a threshold set at, say, the 99th percentile of the errors seen on clean training data.

import numpy as np
# tiny linear autoencoder via PCA-style bottleneck (concept demo)
def recon_error(X, W):                 # W: orthonormal bottleneck basis
    Z = X @ W                          # encode into the subspace
    Xh = Z @ W.T                       # decode back out
    return np.sum((X-Xh)**2, axis=1)   # per-sample reconstruction error
# train on normal X_norm -> learn W (top principal components);
# score new points by error. Anomalies lie off the learned
# subspace and so reconstruct poorly, earning a high error.

In practice you reach for a real deep autoencoder. Here is the idiomatic PyTorch version — train on normals, then threshold the per-sample error at a high percentile of clean validation errors:

import torch, torch.nn as nn

ae = nn.Sequential(
    nn.Linear(30, 16), nn.ReLU(), nn.Linear(16, 4),   # encoder -> bottleneck
    nn.ReLU(), nn.Linear(4, 16), nn.ReLU(), nn.Linear(16, 30),  # decoder
)
opt, loss_fn = torch.optim.Adam(ae.parameters(), 1e-3), nn.MSELoss()

for _ in range(50):                       # train ONLY on normal data
    opt.zero_grad()
    xh = ae(X_norm)
    loss = loss_fn(xh, X_norm)
    loss.backward(); opt.step()

with torch.no_grad():                     # per-sample reconstruction error
    err = ((ae(X_new) - X_new) ** 2).mean(dim=1)
thr = torch.quantile(((ae(X_val) - X_val) ** 2).mean(dim=1), 0.99)
anomalies = err > thr                     # boolean mask of flagged points

The same principle extends well beyond plain autoencoders: convolutional autoencoders handle images, LSTM autoencoders handle sequences (Chapter 16), and variational autoencoders replace the raw error with a probabilistic reconstruction likelihood, giving a more principled anomaly score.

Tip

The threshold matters as much as the model itself. Plot the distribution of reconstruction errors on a clean validation set and choose a percentile (the 99th, say) tied to how many alerts your team can realistically review per day. A perfect scorer with a badly placed threshold is still a bad detector.

27.7 — Supervised vs Unsupervised vs Semi-Supervised Framing

How much labeled data you actually have decides the entire approach, and there are three distinct regimes to recognize.

The supervised regime applies when you have labeled examples of both normals and anomalies. Here you train an ordinary classifier — often gradient-boosted trees (Chapter 10) — paired with imbalance handling such as class weights, focal loss, or careful resampling. This is the most accurate option when labels are plentiful and the anomaly types are stable, but it carries a fundamental blind spot: it can only recognize the kinds of fraud it has already seen, and is helpless against a genuinely novel attack pattern.

The unsupervised regime applies when you have no labels at all. You lean on the core assumption that anomalies are rare and different, and let the data reveal them through Isolation Forest, LOF, DBSCAN, or statistical scores. The upside is that no labels are needed; the downsides are that you must scrape together whatever ground truth you can for validation, and that “rare and different” will sometimes flag a perfectly benign novelty.

The semi-supervised regime is the sweet spot for fraud and intrusion. You train on a sample of normal-only data — which is abundant and cheap, and can usually be assumed clean — and then flag deviations from it. One-Class SVM and autoencoders live here. The appeal is that you need only clean normals, never labeled anomalies, and the approach naturally catches never-before-seen attacks because anything unlike normal is suspect.

Framing	Labels needed	Catches novel anomalies?	Typical tools
Supervised	Both classes	No (only known types)	XGBoost, neural nets + class weights
Unsupervised	None	Yes	Isolation Forest, LOF, DBSCAN
Semi-supervised	Normal only	Yes	One-Class SVM, autoencoders

In production you rarely commit to just one. A common and powerful pattern layers them: an unsupervised or semi-supervised detector catches the unknown unknowns, and its analyst-confirmed hits are fed back as fresh labels to a supervised model — a feedback loop that sharpens the system over time.

Tip

Rule of thumb for picking a regime: count your labels first. Both classes labeled and stable → supervised. Clean normals only → semi-supervised (it catches novel attacks for free). Nothing labeled → unsupervised, and start scraping together a small validation set, because you cannot tune a threshold you cannot measure.

27.8 — Handling Imbalance in the Supervised Layer

When you do have labels and train a classifier on them, the 0.1% positive rate fights you at every step, and there is a small, well-worn toolbox for pushing back. The intuition: a learner shown 999 honest transactions for every fraud will, left alone, decide the cheapest way to be “right” is to call everything honest. You have to make the rare class matter in the loss.

The first lever is class weights — tell the model that a mistake on the rare class costs many times more than a mistake on the common one. This is a one-line change in most frameworks (class_weight="balanced" in scikit-learn, scale_pos_weight in XGBoost) and is usually the first thing to try because it touches no data.

The second lever is resampling: either oversample the minority (duplicate or synthesize fraud cases) or undersample the majority (throw away honest cases). The best-known synthesizer is SMOTE, which invents new minority points by interpolating between a real one and its nearest minority neighbors, fattening the rare class without naive duplication. Crucially, resampling must happen inside cross-validation folds, never before the split — otherwise synthetic copies leak from train into test and the scores lie.

The third lever reshapes the loss itself. Focal loss multiplies the standard cross-entropy by a factor $(1-p_t)^\gamma$ that shrinks the contribution of easy, confidently-correct examples so the model spends its effort on the hard, rare ones:

\[\text{FL}(p_t) = -\,(1-p_t)^{\gamma}\,\log(p_t).\]

In words: start from ordinary log-loss, then dial down how much each easy example counts — the more confident and correct the prediction ($p_t$ near 1), the closer its weight falls to zero, leaving the hard rare cases to dominate the gradient. Also written: with $\gamma = 0$ it collapses back to plain cross-entropy $-\log(p_t)$; add a class-balancing weight $\alpha_t$ and it becomes $\text{FL} = -\alpha_t(1-p_t)^{\gamma}\log(p_t)$.

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.ensemble import HistGradientBoostingClassifier

# resample INSIDE the pipeline so it stays within each CV fold (no leakage)
pipe = Pipeline([
    ("smote", SMOTE(sampling_strategy=0.1, random_state=0)),  # raise fraud to 10%
    ("clf", HistGradientBoostingClassifier(class_weight="balanced")),
])
pipe.fit(X_train, y_train)        # y_train: 0=normal, 1=fraud (highly imbalanced)

Warning

Do not chase a balanced training set as a goal in itself. Over-aggressive oversampling teaches the model that fraud is far more common than it really is, inflating false positives in production. Resample modestly (often to 5–10% positives, not 50%), and always tune the final decision threshold on data with the real imbalance — see the next section.

27.9 — Point vs Contextual vs Collective Anomalies

Anomalies come in three different kinds, and matching the detection method to the kind is half the battle — a tool tuned for one kind is often blind to the others.

A point anomaly is a single instance that is odd entirely on its own — a $50,000 charge on a card that has never once exceeded $500. This is the simplest case, and it is exactly what z-scores, Isolation Forest, and kNN target directly.

A contextual anomaly is a value that is perfectly normal in general but abnormal in its particular context, where the context is usually time or location. Spending $200 on heating is unremarkable in January but strange in July; a London login is fine for a London user but suspicious moments after that same user logged in from Tokyo. The raw value alone is innocent — only the context exposes it. You handle these by engineering context features (hour, season, location, the user’s own baseline) or by modeling per-context, for instance one EWMA per user or a seasonal baseline (Chapter 22).

A collective anomaly is a sequence or group that is anomalous taken together even though every individual member looks normal. A single network packet is fine, but a burst of thousands within one second is a DDoS attack; one small transfer is fine, but a rapid chain of them is structuring to dodge reporting limits. The oddity lives in the pattern, not in any one element, so you detect these with sequence models or windowed aggregates rather than per-point scores.

flowchart LR
    P[Point<br/>single odd value] --> M1[z-score, IForest, kNN]
    C[Contextual<br/>odd given time/place] --> M2[context features,<br/>EWMA, seasonal model]
    G[Collective<br/>odd as a sequence] --> M3[sequence models,<br/>windowed aggregates]

Warning

Pointing a point-anomaly detector at a collective problem fails silently. Every individual packet in a DDoS, and every small transfer in a structuring scheme, is normal in isolation, so a point detector sees nothing wrong and raises no alarm at all. Aggregate the data into windows first, then detect on the aggregates.

27.10 — Evaluation Under Imbalance: PR-AUC and Threshold Setting

When positives are 0.1% of the data, the usual metrics quietly lie to you. Accuracy is worthless — predicting “all normal” scores 99.9%. Even ROC-AUC, normally a reliable summary, turns misleadingly rosy here: it averages over the false-positive rate, and when the negatives vastly outnumber the positives, a flood of false alarms barely moves the FPR even as it wrecks real-world usability.

The honest lens is precision and recall. Precision asks: of the alerts you raised, what fraction were truly fraud? It captures the cost of false alarms — wasted analyst time and blocked legitimate customers. Recall asks: of all the real fraud out there, what fraction did you catch? It captures the cost of misses — money walking out the door. The PR-AUC, the area under the precision–recall curve, summarizes this tradeoff across all thresholds, and crucially its baseline is the positive rate itself rather than a fixed 0.5 — which is exactly why it stays honest under extreme imbalance where ROC-AUC flatters.

The two quantities are worth writing down in terms of the raw counts of true positives (TP), false positives (FP), and false negatives (FN):

\[\text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}.\]

In words: precision is “of everything I alerted on, how much was real”; recall is “of everything that was real, how much did I catch.” One guards against crying wolf, the other against sleeping through the fire. Also written: precision is the positive predictive value $P(\text{fraud}\mid\text{alert})$ and recall is the true-positive rate / sensitivity $P(\text{alert}\mid\text{fraud})$; their harmonic mean is $F_1 = \dfrac{2\,\text{Precision}\cdot\text{Recall}}{\text{Precision}+\text{Recall}}$.

Work a concrete example. Out of 100,000 transactions, 100 are fraud (0.1%). Model A flags 200 transactions, of which 80 turn out to be fraud:

Precision $= 80/200 = 0.40$, and Recall $= 80/100 = 0.80$.
$F_1 = 2 \cdot \dfrac{0.40 \cdot 0.80}{0.40 + 0.80} = 0.53$.
False positives $= 200 - 80 = 120$ — meaning 120 honest customers were inconvenienced to catch 80 fraudsters.

Threshold setting is a business decision, never a default. The model emits a continuous score; you choose the cutoff that balances precision against recall given two real constraints — how many alerts your analysts can actually review, and the asymmetric costs of the two error types (a missed $10,000 fraud versus one annoyed customer). Fix the threshold on a validation set, monitor it in production, and move it as conditions shift.

In code, you rank by score and read the curve directly, then pick the operating point your constraints allow:

from sklearn.metrics import average_precision_score, precision_recall_curve
import numpy as np

scores = model.score_samples(X_val) * -1     # higher = more anomalous
print("PR-AUC:", average_precision_score(y_val, scores))   # honest summary

prec, rec, thr = precision_recall_curve(y_val, scores)
# smallest threshold that still keeps precision >= 0.40 (analyst capacity)
ok = prec[:-1] >= 0.40
chosen = thr[ok][np.argmax(rec[:-1][ok])]    # among those, maximize recall
print("operating threshold:", chosen)

Tip

Report PR-AUC and the full precision–recall curve, not a single accuracy number. Then state the operating point in plain words: “at this threshold we catch 80% of fraud at 40% precision, costing about 120 false alerts per 100k transactions.” That sentence is what the business actually decides on.

27.11 — Streaming and Online Detection

Fraud and intrusion happen now, and a model that scores yesterday’s data in a nightly batch simply lets the money walk out the door. Streaming (online) detection scores each event the instant it arrives, in milliseconds, while continuously updating its sense of what normal looks like.

Three demands separate the streaming setting from comfortable batch work. The first is low latency: you must score the transaction before it is authorized, not after the fact. The second is bounded memory: you cannot store the entire history, so you keep compact running summaries — EWMA states, probabilistic sketches, sliding windows — rather than the full dataset. The third is concept drift: “normal” itself evolves with holiday spending, new products, and fresh attack tactics, so the model must adapt or it slowly rots. EWMA forgets old data by construction, and tree- or forest-based online variants periodically refresh themselves to keep pace.

flowchart LR
    S[event stream] --> F[extract features<br/>+ running stats]
    F --> M[online scorer<br/>EWMA / online IForest]
    M --> T{score &gt; threshold?}
    T -->|yes| AL[alert / block / step-up auth]
    T -->|no| OK[allow]
    AL --> FB[analyst feedback]
    OK --> U[update running stats]
    FB --> U
    U -.adapt to drift.-> M

A modern streaming detector worth knowing by name is Half-Space Trees (and the closely related Robust Random Cut Forest used in AWS), an online cousin of Isolation Forest that maintains its random trees over a sliding window and updates them as events flow past. The river library — the de-facto Python toolkit for online ML — exposes it with a one-sample-at-a-time API that mirrors how the data actually arrives:

from river import anomaly, preprocessing

model = preprocessing.MinMaxScaler() | anomaly.HalfSpaceTrees(seed=0)
for x in transaction_stream:          # x is a dict of features, one event
    score = model.score_one(x)        # anomaly score, computed in microseconds
    if score > 0.7:
        raise_alert(x)
    model.learn_one(x)                # update the model with this event

A pragmatic production architecture pairs two layers. A fast streaming layer blocks the obviously bad and challenges the merely suspicious with step-up authentication, such as a one-time passcode. A slower batch layer then retrains overnight on richer features and confirmed labels. This split between a real-time speed layer and a thorough batch layer is the classic lambda architecture applied to detection.

Warning

Online adaptation cuts both ways. If you let the model quietly absorb fraudulent traffic into its idea of “normal,” a slow, patient attacker can poison the baseline until the attack looks ordinary and slips through unflagged. Gate the updates on confirmed labels, and cap how fast the baseline is allowed to move.

27.12 — A Worked Fraud Example and the Security Tie-In

To pull the chapter together, walk a concrete card-fraud pipeline from raw transaction to action, and then see how the very same machinery powers security operations.

It begins with data and features. Each transaction carries an amount, a merchant category, a timestamp, and a location, on top of which you engineer the context that actually carries the signal: the amount versus the user’s 30-day mean, the time since their last transaction, the distance from their last location, the count of transactions in the past hour. The bulk of the effort lives here, because good features beat fancy models almost every time.

Next comes the model stack. A semi-supervised Isolation Forest scores every transaction in real time, needing no labels and catching novel fraud. In parallel, a supervised gradient-boosted model trained on confirmed chargebacks scores the patterns you already understand well. You combine the two scores into a single risk number.

Then the threshold and action, tiered by that risk: a low score is allowed through, a medium score triggers step-up authentication such as a one-time passcode, and a high score blocks the transaction and queues it for analyst review. Crucially, every confirmed outcome — analyst verdicts and chargebacks alike — flows back as a label to retrain the supervised model overnight, closing the loop.

flowchart TD
    TX[transaction] --> FE[feature engineering<br/>amount-vs-baseline, velocity, geo]
    FE --> U[unsupervised<br/>Isolation Forest]
    FE --> S[supervised<br/>gradient boosting]
    U --> C[combine scores]
    S --> C
    C --> D{risk tier}
    D -->|low| OK[approve]
    D -->|medium| OTP[step-up auth]
    D -->|high| BLK[block + review]
    BLK --> LAB[confirmed labels]
    LAB -.nightly retrain.-> S

The security tie-in is that this exact machinery underpins cybersecurity detection, only the features change. Intrusion detection (a network IDS) flags anomalous traffic — unusual ports, payload sizes, or connection rates, which is frequently a collective anomaly. UEBA (User and Entity Behavior Analytics) builds a per-user behavioral baseline and flags departures from it: impossible travel (two logins on different continents minutes apart, a textbook contextual anomaly), sudden privilege escalations, or off-hours bulk data access — the classic signature of a compromised account or a malicious insider. Malware and DGA detection scores process behavior or the character entropy of domain names. In a security operations center, all of these scores flow into a SIEM that correlates and ranks the alerts for human triage.

The deep parallel is that fraud and intrusion are both adversarial anomaly detection. The adversary actively studies your detector and adapts to it, so static models decay and a constant feedback loop — confirmed cases sharpening the next model — is not a luxury but a condition of survival.

Tip

Whether the label is “fraud” or “intruder,” the recipe is identical: engineer behavioral and contextual features, baseline what normal looks like, score the deviations, tier the response by risk, and feed confirmed outcomes back into the model.

27.13 — Calibrated Thresholds and Conformal Anomaly Detection

Every method so far emits a raw score — a Mahalanobis distance, a path-length score, a reconstruction error — but a raw score is not a guarantee. The intuition: a number like “0.83” tells you this point looks weird, but it does not promise “at most 1% of normal traffic will ever score this high.” Turning a score into a controlled false-alarm rate is the job of calibration, and the cleanest tool for it is conformal anomaly detection.

The idea is disarmingly simple. Hold out a clean calibration set of known-normal points and record their anomaly scores. For a new point with score $s$, the conformal p-value is the fraction of calibration scores at least as extreme:

\[p(x) = \frac{1 + |\{i : s_i \ge s(x)\}|}{n + 1}.\]

In words: ask “out of my pile of normal examples, what fraction looked at least this anomalous?” — if almost none did, the p-value is tiny and the point is a confident anomaly. Also written: $p(x) = \dfrac{1 + \#\{\text{calibration scores} \ge s(x)\}}{n+1}$, the empirical right-tail rank of the new score among $n$ calibration scores, smoothed by the $+1$ so it can never be exactly zero.

As a worked example, suppose your calibration set has $n = 99$ clean scores and a new transaction scores higher than all but one of them — so exactly $2$ calibration scores are at least as extreme as the new one. Then $p(x) = (1 + 2)/(99 + 1) = 0.03$. At $\alpha = 0.01$ this point is not flagged (0.03 > 0.01); at $\alpha = 0.05$ it is. The same number, different alert budgets — and the $+1$ smoothing guarantees the smallest possible p-value here is $1/100 = 0.01$, never zero.

The payoff is a distribution-free guarantee: if you flag every point with $p(x) \le \alpha$, then on genuinely normal data you raise at most a fraction $\alpha$ of false alarms, whatever the underlying score distribution — no Gaussian assumption, no tuning. Set $\alpha = 0.01$ and you have pinned your false-positive rate to 1% by construction, which is exactly the knob a security team wants when its alert budget is fixed.

import numpy as np
# scores on a held-out CLEAN calibration set, and on new points
cal = model.score_samples(X_cal) * -1        # higher = more anomalous
new = model.score_samples(X_new) * -1
# conformal p-value: right-tail rank of each new score among calibration scores
p = (1 + (cal[None, :] >= new[:, None]).sum(axis=1)) / (len(cal) + 1)
flagged = p <= 0.01                          # guaranteed <= 1% false-alarm rate

Tip

Conformal calibration cleanly separates the two decisions you were silently mixing before: the model decides how to score, and $\alpha$ decides how aggressive to be — with a real false-positive guarantee attached. When normal traffic drifts, the calibration set’s scores drift with it, so re-fitting the quantile periodically keeps the rate honest.

27.14 — Drift Detection and Active Learning

Two slow leaks sink anomaly systems that look healthy on launch day, and both deserve their own machinery. The intuition: the world moves underneath a frozen detector (today’s “normal” is not last quarter’s), and the labels you need to improve it are scarce and expensive — so you must both notice when the ground shifts and spend your few labels wisely.

The first leak is concept drift: the distribution of normal behavior changes — a holiday spending surge, a new product line, a migrated data pipeline — and a detector tuned to the old normal starts crying wolf (or going quiet). You catch it by watching the score distribution itself, not the rare labels. A practical, label-free monitor compares the recent window of anomaly scores against a reference window with a two-sample test such as Kolmogorov–Smirnov or population stability index (PSI); a significant shift means “normal has moved, re-baseline.” This is cheaper and faster than waiting for chargebacks to reveal that recall collapsed weeks ago.

from scipy.stats import ks_2samp
# reference vs. recent windows of the detector's own scores (no labels needed)
stat, pval = ks_2samp(ref_scores, recent_scores)
if pval < 0.01:                  # score distribution shifted -> normal drifted
    trigger_rebaseline()         # refit calibration quantile / retrain detector

The second leak is the label famine: confirmed anomalies trickle in slowly, so you cannot afford to label at random. Active learning closes the gap by sending the most informative cases to your analysts — and the natural choice in detection is the band of points sitting right at the decision boundary, where the model is least sure. Each analyst verdict on a borderline case sharpens the threshold and feeds the supervised layer far faster than labeling obvious cases would.

flowchart LR
    SC[score every event] --> B{near threshold?<br/>uncertain}
    B -->|yes| Q[queue for analyst<br/>active-learning sample]
    B -->|clearly low| OK[allow]
    B -->|clearly high| BL[block + review]
    Q --> LB[label]
    BL --> LB
    LB -.retrain / re-calibrate.-> SC

The two mechanisms reinforce each other: drift detection tells you when the model has gone stale, and active learning gives you the cheapest path to refreshing it — a label budget spent on the cases that move the boundary most. Together they turn a static detector into one that survives an adversary who is, by design, always changing the game.

Warning

Drift monitoring and baseline poisoning (Section 27.11) pull in opposite directions: you want the model to adapt to benign drift but resist adversarial drift. Gate adaptation on confirmed-normal data and cap the update rate, so a patient attacker cannot smuggle the new “normal” past your KS test one small step at a time.

27.15 — Explaining and Acting on Alerts

A detector that fires without saying why is operationally useless: an analyst staring at a flagged transaction needs a reason before they can confirm, dispute, or escalate it, and a customer whose card is declined deserves a defensible explanation. The intuition is that a score is a verdict, but an investigation needs evidence — so a production system must turn each high score into a short, human-readable story of which features drove it.

The practical tool is per-prediction attribution. For supervised models, SHAP values decompose a single score into additive per-feature contributions, so an alert arrives annotated: “+0.4 from amount being 12× the 30-day mean, +0.3 from a new device, +0.2 from impossible travel.” For the unsupervised side, the same idea applies more crudely — report which features had the largest standardized deviation, or for Isolation Forest, which splits isolated the point fastest.

import shap
explainer = shap.TreeExplainer(supervised_model)     # gradient-boosted fraud model
sv = explainer.shap_values(flagged_tx)               # one row = one alert
top = sorted(zip(feature_names, sv[0]), key=lambda p: -abs(p[1]))[:3]
print("alert driven by:", top)   # e.g. [('amount_vs_baseline', 0.41), ...]

Beyond the single alert, two operational realities shape the whole system. The first is the alert budget: analysts can review only so many cases a day, so the threshold (Section 27.10) is set as much by review capacity as by statistics, and alerts are ranked so the riskiest reach a human first. The second is the action tier itself — not every alert should block. A graded response (allow, step-up authentication, hold for review, hard block) lets the system act proportionally to the score and to the cost of being wrong, sparing the honest customer a declined card while still stopping the obvious fraud.

Tip

Treat explanations as part of the model’s output, not an afterthought. An alert shipped with its top three driving features is one an analyst can action in seconds; a bare score is one they must reverse-engineer — and under a tight alert budget, that difference decides whether the system is usable at all.

27.16 — Quick reference

Term / method	What it is	When / why to use it
z-score $z=(x-\mu)/\sigma$	Steps from the mean, flag $\lvert z\rvert>3$	One roughly Gaussian feature; fastest possible alarm
Robust z / MAD	Median + scaled MAD in place of mean + $\sigma$	Skewed, heavy-tailed features where a few extremes wreck $\sigma$
Mahalanobis $D_M$	Distance after whitening by $\Sigma^{-1}$	Correlated features; catches joint oddities per-axis z misses
EWMA $s_t=\alpha x_t+(1-\alpha)s_{t-1}$	Running mean/variance that forgets old data	Streaming signals where “normal” drifts over time
kNN distance	Distance to $k$-th nearest neighbor	Simple geometric outliers under a single global density
LOF	Local density ratio vs. neighbors	Clusters of different densities; finds local outliers
DBSCAN	Density clustering; noise points = anomalies	Free outliers, no cluster count needed
Isolation Forest $s=2^{-E[h(x)]/c(n)}$	Short random-tree path length = anomaly	Default first try on tabular data; linear time, no metric
One-Class SVM ($\nu$)	RBF frontier wrapped around normals	Small, clean, normal-only sets; catches novel anomalies
Autoencoder error $\lVert x-\hat x\rVert^2$	Reconstruction error after training on normals	High-dim / structured data (images, sequences)
SMOTE + class weights / focal loss	Imbalance handling in the supervised layer	When both classes are labeled but positives are rare
Point / contextual / collective	Odd alone / odd-in-context / odd-as-sequence	Match detector to the anomaly type before scoring
PR-AUC, precision, recall	Honest metrics under extreme imbalance	Always judge by these, never accuracy or ROC-AUC
Conformal p-value $p=\frac{1+\#\{s_i\ge s\}}{n+1}$	Right-tail rank vs. clean calibration scores	Distribution-free guarantee: flag $p\le\alpha$, FP rate $\le\alpha$
KS test / PSI on scores	Two-sample drift test on the score distribution	Label-free trigger to re-baseline before recall collapses
Active learning	Query the points nearest the boundary	Spend a scarce label budget where it sharpens the threshold
SHAP / top deviations	Per-feature attribution for one alert	Turn a bare score into an analyst-actionable story

27.17 — Key takeaways

Anomaly detection means modeling normal and measuring deviation — reach for it when positives are rare, undefined, or unlabeled, precisely where ordinary classification breaks.
Statistical methods (z-score, Gaussian/Mahalanobis, EWMA) are simple, fast, and often enough; log-transform or use robust stats (median/MAD) on skewed, heavy-tailed data.
Distance and density methods (kNN, LOF, DBSCAN) catch geometric and local-density outliers but scale poorly and degrade in high dimensions.
Isolation Forest is the fast, robust default for tabular data — anomalies are easy to isolate, so a short random-tree path length flags them.
One-Class SVM and autoencoder reconstruction error are semi-supervised: train on clean normals, flag whatever fits or reconstructs poorly — and both catch novel anomalies.
When labels exist, handle imbalance in the supervised layer with class weights, modest resampling (SMOTE inside the CV fold), or focal loss — never chase a 50/50 training set.
Match the type to the method: point (single odd value), contextual (odd given time or place), collective (odd as a sequence) — and aggregate into windows before hunting collective anomalies.
Under extreme imbalance, judge models by PR-AUC and the precision–recall curve, never accuracy or ROC-AUC, and treat the threshold as a business decision tied to alert capacity and asymmetric costs.
Fraud and intrusion are adversarial and streaming: score in real time, adapt to drift, guard against baseline poisoning, and loop confirmed labels back into the model.
Calibrate the threshold with a clean calibration set — conformal p-values convert a raw score into a distribution-free false-alarm guarantee, pinning the FP rate to your alert budget without any Gaussian assumption.
Watch for drift in the score distribution itself (KS test / PSI) so you re-baseline before recall quietly collapses, and spend scarce labels with active learning on the uncertain points near the boundary.
Ship each alert with an explanation (SHAP or top deviating features) and a graded action tier, since an unexplained, all-or-nothing alert is rarely usable under a real review budget.

27.18 — See also

Classification Algorithms — supervised detectors, support-vector machines, and decision boundaries.
Ensemble Methods — gradient-boosted trees for the supervised fraud layer, and the forest idea behind Isolation Forest.
Clustering & Unsupervised Learning — DBSCAN and density estimation repurposed as detection tools.
Model Evaluation & Tuning — precision, recall, PR/ROC curves, and threshold selection under imbalance.
Dimensionality Reduction — taming high-dimensional feature spaces before distance-based detection.
Generative Models — autoencoders and variational autoencoders for reconstruction-based scoring.
Recurrent & Sequence Models — LSTM autoencoders and sequence models for collective and temporal anomalies.
Time Series & Forecasting — seasonal baselines and contextual anomalies over time.
Explainable AI & Interpretability — SHAP and per-prediction attribution for turning scores into actionable alerts.
ML Across Industries — fraud, security, and fault detection deployed in real production settings.
MLOps & Deployment — streaming inference, monitoring, drift handling, and retraining loops.
Probability & Statistics — the distributions and tail reasoning behind statistical detectors and conformal p-values.
Data Preprocessing — feature engineering, scaling, and the context features that make detection work.

↪ The thread continues → Chapter 28 · 🏦 ML Across Industries

These patterns recur across every sector. The next chapter zooms out to how ML actually lands in finance, healthcare, and beyond — and why it so often fails there.

📖 All chapters | ← 26 · 🛒 Recommender Systems | 28 · 🏦 ML Across Industries →