Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 29.1 — Model deployment & serving
  • 29.2 — ML pipelines & workflows
  • 29.3 — Containers & orchestration
  • 29.4 — Monitoring, versioning & CI/CD
  • 29.5 — Scalability & distributed training
  • 29.6 — Spark / Hadoop / distributed data
  • 29.7 — Distribution Shift and Monitoring in Depth
  • 29.8 — Testing in Production and the Serving Stack
  • 29.9 — Serving generative models: LLMOps
  • 29.10 — Cost, efficiency & the FinOps of serving
  • 29.11 — Security, governance & responsible deployment
  • 29.12 — Quick reference
  • 29.13 — Key takeaways
  • 29.14 — See also

Chapter 29 — 🔧 MLOps & Deployment

📖 All chapters  |  ← 28 · 🏦 ML Across Industries  |  30 · 🚀 AI Infrastructure & Efficient Inference →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

A trained model sitting in a notebook is worth nothing; a model serving predictions reliably, at scale, that you can monitor, roll back, and retrain is worth everything. MLOps (Machine Learning Operations) is the discipline of getting models from a researcher’s laptop into production and keeping them healthy there — the bridge between data science and software/ops engineering. It sits at the very end of the ML workflow, downstream of training and evaluation, and it is where most real-world ML effort actually goes.

🧭 In context: Production engineering for ML · used to deploy, scale, monitor, and continuously retrain models · the key idea is that a model is a living artifact — versioned, observed, and re-shipped like any other software, but with data and the model as extra moving parts.

💡 Remember this: a model in production is a living artifact that silently rots as the world drifts, so the real work is the loop around it — version it, serve it, watch it, and re-ship it the moment the data moves.

The full lifecycle ties the whole chapter together:

flowchart LR
  D[Data] --> P[Pipeline / DAG]
  P --> T[Train + track experiment]
  T --> R[Model registry]
  R --> Deploy[Deploy: batch / online]
  Deploy --> Serve[Serve via gateway]
  Serve --> M[Monitor: drift + metrics]
  M -->|drift detected| CT[Continuous training]
  CT --> P
  M -->|healthy| Serve

The same loop, animated — watch the artifact flow from data, through serving, and back around when drift is detected. The whole chapter is one turn of this wheel:

data train deploy serve monitor retrain

29.1 — Model deployment & serving

Deployment is the act of making a trained model available to produce predictions for real consumers. The first and most consequential design choice is batch versus online serving.

Batch (offline) serving runs predictions on a schedule over a large set of inputs and stores the results. Think of a churn classifier that scores every customer every night and writes a churn_score column to a table; the app just reads the table. It is simple, cheap, and throughput-optimized, but predictions are stale between runs and you cannot score an input the system has never seen.

Online (real-time) serving loads the model behind a service that answers one request at a time with low latency. Think of a fraud model scoring a card swipe in 40 ms while the customer waits. It is fresh and handles novel inputs, but you now own a live service with latency budgets, autoscaling, and uptime.

Here is the same input flowing through both paths, so the contrast is concrete:

flowchart LR
  subgraph Batch
    DB[(all customers)] --> Job[nightly scoring job]
    Job --> Tbl[(scores table)]
    Tbl --> App1[app reads cached score]
  end
  subgraph Online
    Req[one card swipe] --> Svc[model service]
    Svc -->|40 ms| Resp[score returned live]
  end

The same contrast, in motion: batch fills a table on a slow nightly heartbeat; online answers each request the instant it arrives. Watch which path keeps the user waiting:

batch — slow heartbeat, stale between runs nightly job writes whole table at once

online — one request, answered live per swipe ~40 ms each
Dimension Batch Online
Latency minutes–hours milliseconds
Freshness stale until next run always current
Cost model cheap, bulk compute always-on service
Failure blast radius rerun the job user-facing outage
Typical use reports, nightly scoring fraud, ads, search

Online models are usually exposed as a REST endpoint: an HTTP POST carrying the features as JSON, returning the prediction as JSON. A minimal serving wrapper:

# A tiny online endpoint — the mechanism, not production-grade
from flask import Flask, request, jsonify
import joblib
model = joblib.load("model.pkl")     # load ONCE at startup, not per request
app = Flask(__name__)

@app.post("/predict")
def predict():
    x = request.json["features"]      # e.g. [0.2, 1.4, 3.0]
    y = model.predict([x])[0]         # single-row inference
    return jsonify(prediction=float(y))

The crucial line is loading the model once at startup — reloading a model per request is the classic latency-killing mistake. Suppose loading the artifact takes 800 ms and inference takes 5 ms. Load-per-request gives every caller an 805 ms wait; load-once gives the first request a one-time cost and every subsequent caller a 5 ms response. Same code, two lines apart, a 160× difference at the tail.

In the real world you rarely hand-roll the Flask layer. A purpose-built serving framework gives you batching, health checks, metrics, and concurrency for free. The same model behind FastAPI (async, with automatic request validation) looks like this:

# Idiomatic online serving with FastAPI + pydantic validation
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

model = joblib.load("model.pkl")          # load ONCE at import time
app = FastAPI()

class Req(BaseModel):                      # rejects malformed input at the door
    features: list[float]

@app.post("/predict")
def predict(req: Req):
    y = model.predict([req.features])[0]
    return {"prediction": float(y)}
# run: uvicorn serve:app --workers 4

For dedicated ML serving, BentoML, TensorFlow Serving, TorchServe, and NVIDIA Triton add dynamic request batching (coalescing many concurrent calls into one GPU forward pass), model versioning, and GPU scheduling out of the box — the difference between a demo endpoint and one that holds a latency SLA under load.

There is also a third path between the two extremes worth naming: streaming (near-real-time) serving. Instead of a request/response service or a once-a-night job, predictions are produced continuously as events flow through a stream (Kafka, Kinesis, Flink). Think of an IoT pipeline scoring sensor readings the moment they land, or a recommendation system updating as a user clicks. It gives you freshness close to online serving without a synchronous latency budget on the user’s critical path — the consumer reads the latest score from a topic or cache.

Mode Trigger Latency Example
Batch schedule (cron) minutes–hours nightly churn scoring
Streaming event arrives seconds IoT anomaly flags
Online synchronous request milliseconds fraud at card swipe

As soon as you have more than one model, you put a model gateway in front of them. A gateway is a single entry point that routes a request to the right model and version, applies auth and rate limits, logs every request/response for monitoring, and can split traffic between versions (for canary/shadow, §29.4). It decouples callers (who just hit one stable URL) from the messy reality of many model versions behind it.

flowchart LR
  C[Client app] --> G[Model gateway]
  G -->|auth, route, log| V1[model v1]
  G --> V2[model v2 canary]
  G --> Sh[shadow model]

Tip

Default to batch until a real-time requirement forces your hand. A nightly job you can rerun is dramatically less operational burden than a 24/7 low-latency service.

Latency that the user actually feels: tail latency and SLOs

The intuition: averages lie. If 99 requests take 20 ms and one takes 5 seconds, the average is a comfortable 70 ms — but one user in a hundred just waited five seconds, and at a billion requests that is ten million furious users. What you serve is not the average; it is the tail.

This is why production teams quote latency as percentiles, written p50 / p95 / p99: the p99 is the value that 99% of requests come in under. A Service Level Objective (SLO) is a promise about that number — “p99 latency under 150 ms” — and the serving stack is tuned to hold it, not the mean.

Symbol Reads as Why it matters
p50 (median) half of requests are faster “typical” experience
p95 95% are faster the common slow case
p99 99% are faster the SLO you’re usually held to
p99.9 the worst 1-in-1000 what a power user hits hourly

In words: the p99 is the slowest response among the fastest 99% of requests — the worst case you promise not to exceed except one time in a hundred.

The picture below is a latency histogram: a fat cluster of fast requests on the left, and a long thin tail crawling to the right. The mean (orange) sits inside the comfortable bulk; the p99 (pink) lives way out in the tail. That gap is the whole point — you serve the tail, but the mean tells you about the bulk.

faster slower → mean p99

A subtle, costly fact: batching trades median for tail. Coalescing requests (§29.9) raises throughput but a request that arrives just as a batch closes must wait for the next one, padding the tail. The fix is a bounded batch window (e.g. “wait at most 5 ms, then fire”) so the tail penalty is capped by design.

# p99 from a stream of measured latencies — the number your SLO is written against
import numpy as np
latencies_ms = np.array([12, 14, 13, 15, 11, 250, 14, 13, 16, 12])  # one slow outlier
print("p50", np.percentile(latencies_ms, 50))   # 13.5 ms — looks great
print("p99", np.percentile(latencies_ms, 99))   # ~228 ms — the real story
# assert the SLO in a load test, not after a 3am page:
assert np.percentile(latencies_ms, 95) < 100, "p95 SLO breached"

29.2 — ML pipelines & workflows

Think of a recipe where some steps must happen before others — you can’t ice the cake before you bake it, but you can whip the frosting while the cake is in the oven. A pipeline is that recipe written down so a machine can run it, and run independent steps at the same time.

Training a model is never one step; it is ingest → validate → preprocess → train → evaluate → register. An ML pipeline encodes these steps and their dependencies as a DAG (Directed Acyclic Graph) — nodes are tasks, edges are “must run before”, and “acyclic” guarantees no step waits on itself, so a valid run order always exists.

flowchart LR
  I[ingest] --> V[validate]
  V --> P[preprocess]
  P --> Tr[train]
  P --> Te[build test set]
  Tr --> E[evaluate]
  Te --> E
  E --> Reg[register model]

Orchestration is the machinery that runs the DAG: it works out the order, runs independent branches in parallel, retries failed tasks, and resumes from where a failed run stopped. In the DAG above, an orchestrator sees that train and build test set both depend only on preprocess, so it runs them concurrently, then waits for both before starting evaluate — that “wait for all parents” rule is the whole job. The common engines differ in flavor:

Tool Flavor Sweet spot
Airflow Python DAGs, schedule-driven general data/ML batch workflows
Kubeflow Kubernetes-native, container per step heavy ML on K8s, GPU steps
Prefect Python-first, dynamic flows lighter setup, dynamic/parametrized runs

The property that makes pipelines safe to retry is idempotency: running a step twice with the same inputs leaves the same result as running it once. Concretely, a non-idempotent step does INSERT (a retry doubles the rows); an idempotent step does UPSERT keyed on a partition, or writes to output/run_id=2026-06-25/ and overwrites that path. Because orchestrators retry aggressively, a non-idempotent step turns one transient network blip into corrupted data.

Trace it once: a task writes 1,000 score rows, then the node dies after the write but before reporting success. The orchestrator marks the task failed and retries. The INSERT version now has 2,000 rows (1,000 of them duplicates) and every downstream average is wrong; the overwrite version re-writes the same 1,000 rows to the same partition path and the table is exactly as if the retry never happened. The only difference is which write verb you chose.

# Non-idempotent vs idempotent write
def bad(rows, db):
    db.insert("scores", rows)            # retry -> duplicate rows

def good(rows, day):
    path = f"s3://bucket/scores/day={day}/part.parquet"
    write_parquet(path, rows, mode="overwrite")  # retry -> identical state
Warning

The most common pipeline bug is a step that appends. The first retry after a timeout silently double-counts. Make every write either overwrite-by-partition or upsert-by-key.

In practice you write this DAG declaratively in an orchestrator. The same preprocess → train → evaluate shape in modern Prefect is just decorated Python functions — the framework infers the dependency edges from how you call them:

from prefect import flow, task

@task(retries=2)                       # automatic retry == why idempotency matters
def preprocess(raw): ...
@task
def train(features): ...
@task
def evaluate(model, features): ...

@flow                                  # the DAG
def training_flow(raw):
    feats = preprocess(raw)
    model = train(feats)               # depends on preprocess
    return evaluate(model, feats)      # depends on both

In Airflow, the same graph is expressed with operators and explicit >> edges (preprocess >> train >> evaluate); in Kubeflow, each step is a containerized component wired into a pipeline. The flavor differs, the DAG is the same.

Validate the data before you train: the gate that catches silent corruption

The intuition: a kitchen inspects ingredients before cooking, not after plating. A pipeline that trains on whatever arrives will happily learn from a column that silently turned to nulls, a feature whose unit changed, or a category that vanished — and you only find out weeks later when predictions look strange. A data validation step is a cheap gate that fails the run loudly before a single GPU-hour is spent.

You assert expectations about each batch — schema, ranges, null rates, allowed categories — and halt the DAG when they break. Great Expectations, Pandera, and TFX’s TFDV are the common tools; the idea is the same as a unit test, applied to data instead of code:

# Pandera — a schema the data must satisfy or the pipeline stops
import pandera as pa
from pandera import Column, Check

schema = pa.DataFrameSchema({
    "age":    Column(int,   Check.in_range(0, 120)),      # no negative / 200-yr-olds
    "income": Column(float, Check.greater_than_or_equal_to(0)),
    "region": Column(str,   Check.isin(["NA", "EU", "APAC"])),  # closed vocabulary
})

def validate(df):
    schema.validate(df, lazy=True)   # raises on ANY violation -> fails the DAG node
    return df

This is the inexpensive insurance that turns “model mysteriously degraded” into “the run failed at 2 a.m. with a clear schema error,” long before the bad data reaches training.

29.3 — Containers & orchestration

“It works on my machine” is the disease; containers are the cure. A Docker container packages your code together with its exact Python version, libraries, system packages, and the model artifact into one immutable image. Anyone — your laptop, CI, the production cluster — runs the same bytes, so the environment stops being a variable.

A serving image is just a recipe:

FROM python:3.11-slim
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt   # pin versions here
COPY serve.py model.pkl ./
CMD ["gunicorn", "-w", "4", "serve:app"]   # 4 workers

docker build turns this into an image (the blueprint); docker run starts a container (a running instance). The image is to the container what a class is to an object: build once, run many identical instances. Pin your dependency versions in requirements.txt — an unpinned scikit-learn will silently upgrade on the next build and can change predictions, so the image you tested and the image you ship are no longer the same bytes.

One container is easy. A hundred of them across machines — with restarts on crash, scaling under load, rolling updates, and network routing — is Kubernetes (K8s), the standard container orchestrator. You declare the desired state (“run 3 replicas of model-v2”) and K8s continuously reconciles reality to match: a crashed pod is restarted, a dead node’s pods are rescheduled elsewhere.

flowchart TB
  subgraph K8s cluster
    D[Deployment: desired = 3 replicas] --> P1[pod]
    D --> P2[pod]
    D --> P3[pod]
    S[Service / load balancer] --> P1
    S --> P2
    S --> P3
  end
  U[traffic] --> S

The reconciliation loop is worth watching: you declare “3 replicas,” a pod dies, and K8s notices the gap and conjures a replacement — no human in the loop. The dying pod fades, the count drops to 2, then a fresh pod pops back to 3:

Deployment: desired = 3 replicas — K8s reconciles toward it healthy dies → replacement created pod pod

The reconciliation loop, in words: you declare “3 replicas.” A node hosting pod P2 dies. K8s observes that observed = 2 but desired = 3, so it schedules a replacement pod on a healthy node and the load balancer starts routing to it — no human paged, no script run. This control loop, comparing desired against observed and acting on the gap, is the same idea as the orchestrator’s “wait for parents” in §29.2, just applied to running containers instead of pipeline steps.

The mental model: Docker makes one box reproducible; Kubernetes keeps a fleet of those boxes alive and balanced.

A K8s Deployment is declared as YAML — the desired state the control loop reconciles toward. The minimal serving spec is short enough to read end to end:

# deployment.yaml — "always keep 3 of this image alive and healthy"
apiVersion: apps/v1
kind: Deployment
metadata: { name: fraud-model }
spec:
  replicas: 3                      # desired state; K8s reconciles toward it
  selector: { matchLabels: { app: fraud-model } }
  template:
    metadata: { labels: { app: fraud-model } }
    spec:
      containers:
        - name: serve
          image: registry/fraud-model:v2   # the exact bytes you tested
          resources:
            requests: { cpu: "500m", memory: "1Gi" }   # scheduler sizing
            limits:   { cpu: "1",    memory: "2Gi" }
          readinessProbe:                    # don't route traffic until model is loaded
            httpGet: { path: /healthz, port: 8080 }

The readinessProbe is the K8s-level twin of “load the model once at startup” from §29.1: a pod is kept out of the load balancer until its model is loaded and /healthz returns OK, so no user ever hits a half-warmed replica.

Tip

You rarely need Kubernetes on day one. A single container behind a managed service (Cloud Run, ECS, a serverless container) carries you a long way — reach for K8s when you genuinely need fleet-scale orchestration.

29.4 — Monitoring, versioning & CI/CD

Software, once correct, stays correct; an ML model silently rots because the world it models drifts away from its training data. So MLOps wraps the model in a feedback loop of versioning, monitoring, and automated re-shipping.

Experiment tracking records, for every training run, the code version, data version, hyperparameters, and resulting metrics (tools: MLflow, Weights & Biases). Without it you cannot answer “which run produced the model in prod?” — the single most-asked question during an incident.

A few lines of MLflow is all it takes to make a run reproducible and registrable — the same calls that later let an incident responder pull the exact artifact:

import mlflow
with mlflow.start_run():
    mlflow.log_params({"max_depth": 6, "lr": 0.1})    # what you tried
    mlflow.log_metric("val_auc", 0.91)                # what you got
    mlflow.sklearn.log_model(model, "model",
                             registered_model_name="churn")  # -> registry, staged
# later: mlflow.sklearn.load_model("models:/churn/Production")

A model registry is the versioned store of trained model artifacts with stage tags (Staging, Production, Archived); a data registry / feature store does the same for datasets and features. Together they make a deployment reproducible: prod model v7 ⇄ training run #142 ⇄ dataset d-2026-06. When a prediction looks wrong six weeks later, this chain is what lets you pull the exact model, code, and data that produced it instead of guessing.

Drift detection is the early-warning system. Data drift is the input distribution shifting (a new device sends sensor values in a new range); concept drift is the input→output relationship itself changing (post-pandemic, the same customer features now imply different churn). You detect it by comparing the live feature distribution against the training distribution.

flowchart LR
  Train[training distribution] --> Cmp{compare bins}
  Live[live distribution] --> Cmp
  Cmp -->|PSI < 0.1| OK[stable: keep serving]
  Cmp -->|PSI 0.1–0.25| Watch[moderate: watch]
  Cmp -->|PSI > 0.25| Retrain[significant: retrain]

Here is the comparison the metric measures, as a picture: the training histogram is the faint reference; the live one slides to the right as the world moves. PSI is just a number that grows as the orange bars pull away from the blue:

training (frozen reference) vs live (drifting) gap grows → PSI rises → retrain

Worked example — Population Stability Index (PSI), the workhorse drift metric. Bin a feature, compare training proportions to live proportions:

\[\text{PSI} = \sum_i (a_i - e_i)\,\ln\!\frac{a_i}{e_i}\]

In words: for each bin, multiply how much its share changed by the log of how many times it grew or shrank, then add up the bins — a single number that is big only when a bin moved a lot both in absolute share and in ratio.

Also written: \(\text{PSI} = D_{\mathrm{KL}}(a\Vert e) + D_{\mathrm{KL}}(e\Vert a)\). Plainly: PSI measures the “distance” between the live histogram and the training one in both directions and adds them up. Measuring both directions is what makes it symmetric — you get the same number whether you call the live or the training set the reference. (Each \(D_{\mathrm{KL}}\) is the standard KL divergence; their sum is the Jeffreys divergence.)

with \(e_i\) the expected (training) fraction in bin \(i\) and \(a_i\) the actual (live) fraction. Rule of thumb: \(<0.1\) stable, \(0.1\)–\(0.25\) moderate, \(>0.25\) significant — retrain.

Do one bin by hand to demystify the formula. Say bin 3 held 20% of training data (\(e_3 = 0.20\)) but now holds 35% of live data (\(a_3 = 0.35\)). That bin contributes \((0.35 - 0.20)\ln(0.35/0.20) = 0.15 \times \ln(1.75) = 0.15 \times 0.56 = 0.084\). One shifted bin already adds 0.084; a few more like it push the sum past the 0.25 retrain line. The term is large when a bin both changed a lot in absolute share and changed a lot in ratio.

import numpy as np
def psi(expected, actual, bins=10):
    cuts = np.percentile(expected, np.linspace(0, 100, bins + 1))
    cuts[0], cuts[-1] = -np.inf, np.inf
    e = np.histogram(expected, cuts)[0] / len(expected)
    a = np.histogram(actual,   cuts)[0] / len(actual)
    e, a = np.clip(e, 1e-6, None), np.clip(a, 1e-6, None)  # avoid log(0)
    return np.sum((a - e) * np.log(a / e))

train = np.random.normal(0, 1, 10000)
live  = np.random.normal(0.5, 1, 10000)   # shifted mean
print(round(psi(train, live), 3))         # ~0.2x -> drift flagged

Continuous training (CT) closes the loop: when drift crosses a threshold (or on a schedule), the pipeline automatically retrains on fresh data, evaluates, and — only if the new model beats the incumbent on a holdout — promotes it. This is what makes ML CI/CD different from ordinary software CI/CD: you are also continuously integrating new data, not just new code.

flowchart LR
  Drift[drift > threshold] --> Re[retrain on fresh data]
  Re --> Ev{beats incumbent on holdout?}
  Ev -->|yes| Promote[promote to Production]
  Ev -->|no| Keep[keep incumbent, alert]

The three CI/CD pipelines of MLOps are worth separating, because they answer to different triggers — a useful mental map of the whole chapter’s automation:

flowchart LR
  CodeChg[code change] --> CI[CI: test + build image] --> CD[CD: deploy service]
  DataChg[new data / drift] --> CT[CT: retrain + validate]
  CT --> Reg[(registry)]
  CI --> Reg
  Reg --> CD

In words: classic software ships code through CI/CD; MLOps adds a third pipeline, continuous training, that ships a new model when the data changes — all three converging on the registry that the serving layer reads from.

When it is time to release a new version, the deployment strategy controls risk:

Strategy How it works Buys you
Shadow new model gets a copy of live traffic, its outputs logged but not served risk-free validation on real traffic
Canary route a small % (e.g. 5%) to the new model, watch metrics, ramp up limited blast radius
Blue-green full second environment (green); flip all traffic at once, instant rollback fast cutover and rollback

flowchart LR
  T[traffic] --> Can{canary router}
  Can -->|95%| Old[model v1]
  Can -->|5%| New[model v2]
  New -.metrics ok.-> Ramp[ramp to 100%]
  New -.metrics bad.-> Roll[rollback to v1]

Warning

Accuracy alone is a lagging, often unavailable signal — ground-truth labels can arrive days late. Monitor leading signals too: input drift, prediction distribution, null/feature ranges, and latency. By the time accuracy visibly drops, you have been serving bad predictions for a while.

29.5 — Scalability & distributed training

When a model or its dataset outgrows a single machine, you split the work across devices. There are two orthogonal axes.

The intuition: imagine eight students each grading a different stack of the same exam, then meeting to average their grading adjustments so everyone’s answer key stays identical. Each works in parallel on their own pile; the meeting (the all-reduce) keeps them in sync.

Data parallelism is the common one: replicate the full model on every GPU, give each a different shard of the batch, compute gradients locally, then all-reduce (average) the gradients so every replica applies the same update and stays in sync. It scales throughput nearly linearly until communication cost dominates, and works whenever the model fits on one device.

Here is the all-reduce that keeps them synced, animated: each GPU computes its own gradient (different colors), they meet in the middle to average, and every GPU walks away holding the same averaged gradient (one color). That convergence is what keeps the replicas in lock-step:

all-reduce: 4 local gradients → one shared average GPU0 GPU1 GPU2 GPU3 ⇅ average gradients ⇅ every GPU now holds the same update → weights stay identical

The averaged gradient that every replica applies is just the mean of the \(K\) local gradients:

\[\bar{g} = \frac{1}{K}\sum_{k=1}^{K} g_k\]

In words: add up each GPU’s locally computed gradient and divide by the number of GPUs — every replica then steps with the same averaged gradient, so their weights never diverge.

Also written: \(\bar{g} = \frac{1}{K}\big(g_1 + g_2 + \cdots + g_K\big)\), which is mathematically identical to computing one gradient over the full concatenated batch — data parallelism is exact, not an approximation.

Model parallelism is for when the model itself does not fit on one device (think very large neural networks). You split the model’s layers/tensors across GPUs; an input flows through GPU 0’s layers, then GPU 1’s, and so on. It removes the memory ceiling but adds cross-device traffic and can leave GPUs idle waiting on each other (pipeline bubbles).

flowchart TB
  subgraph "Data parallel"
    B[batch] --> S1[shard 1 -> full model GPU0]
    B --> S2[shard 2 -> full model GPU1]
    S1 --> AR[all-reduce grads]
    S2 --> AR
  end
  subgraph "Model parallel"
    X[input] --> L1[layers 1-4 GPU0] --> L2[layers 5-8 GPU1] --> Y[output]
  end

The all-reduce step is the heart of data parallelism, so it is worth seeing the arithmetic. Two GPUs each compute a gradient on their own half of the batch; all-reduce averages them so both end the step with identical weights:

import numpy as np
# Each GPU's local gradient (computed on its own shard of the batch)
g0 = np.array([0.4, -0.2, 1.0])   # GPU 0
g1 = np.array([0.6,  0.0, 0.4])   # GPU 1
avg = (g0 + g1) / 2               # all-reduce: sum, then divide by #GPUs
print(avg)                        # [0.5, -0.1, 0.7] applied on BOTH GPUs
# both replicas now hold the same weights -> they stay in lock-step

You almost never wire the all-reduce by hand — frameworks do it. In PyTorch, wrapping a model in DistributedDataParallel (DDP) inserts the all-reduce into the backward pass automatically; you launch one process per GPU and the gradients stay synced for you:

import torch, torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")               # GPU collective backend
model = DDP(model.to(local_rank), device_ids=[local_rank])
# from here, model(...).backward() all-reduces grads across GPUs for free
# launch with: torchrun --nproc_per_node=8 train.py

For models too large even with DDP, FSDP (Fully Sharded Data Parallel) and the DeepSpeed ZeRO family additionally shard the optimizer states, gradients, and parameters across GPUs, cutting per-device memory — the bridge to model parallelism below.

Historically, a parameter server coordinated data-parallel training: worker nodes compute gradients and push them to central server nodes that hold the canonical weights, update them, and return fresh weights. It is simple but the servers become a bandwidth bottleneck, which is why modern stacks favor peer-to-peer all-reduce (every worker talks to every other, no central choke point).

Worked sanity-check on why all-reduce matters: with 8 GPUs and a 100M-parameter model in fp32, each gradient sync moves \(100\text{M} \times 4\,\text{bytes} = 400\,\text{MB}\) per GPU, every step. At thousands of steps this is terabytes of traffic — the reason gradient communication, not raw compute, usually limits scaling, and the reason efficient collective algorithms (ring all-reduce) exist.

Tip

Reach for data parallelism first — it is simpler and covers most cases. Only move to model parallelism when a single example’s model genuinely won’t fit in one device’s memory.

29.6 — Spark / Hadoop / distributed data

Before you can train at scale you must process data at scale, and a single machine’s RAM and disk are the bottleneck. This is the big-data layer.

Hadoop was the foundational generation: HDFS (Hadoop Distributed File System) splits a huge file into blocks replicated across many machines, and MapReduce is its compute model — a map step transforms records in parallel across the cluster, a reduce step aggregates the results. Its weakness is that MapReduce writes intermediate results to disk between every stage, which is slow for the iterative passes ML needs.

Apache Spark is the successor that mostly displaced raw MapReduce by keeping intermediate data in memory across stages — often 10–100× faster for iterative workloads. Spark’s core abstraction is the RDD/DataFrame: a distributed collection partitioned across the cluster, on which you express transformations (map, filter, groupBy) that are lazy — nothing runs until an action (count, collect, write) forces execution, letting Spark optimize the whole plan first.

The canonical word-count, which is “hello world” for distributed data, makes the map→shuffle→reduce shape concrete:

# PySpark — counts words across a cluster, lazily
counts = (spark.read.text("s3://logs/*.txt")
    .rdd.flatMap(lambda r: r[0].split())   # map: explode to words
    .map(lambda w: (w, 1))                  # pair each word with 1
    .reduceByKey(lambda a, b: a + b))       # reduce: sum per word
counts.saveAsTextFile("s3://out/")          # ACTION -> triggers the job

The map→shuffle→reduce dance is the whole trick, so here it is on "the cat the" spread over two machines: each node maps its words to (word, 1) pairs, the shuffle drags every matching key onto one reducer, and the reducer sums. Watch the two the pairs migrate together:

map shuffle → reduce node A: “the cat” the,1 cat,1 node B: “the” the,1 reducer the,2 cat,1

Trace "the cat the" across two partitions: the map stage emits (the,1) (cat,1) on one node and (the,1) on another; Spark shuffles so all the pairs land on the same reducer; reduceByKey then sums them to (the,2) (cat,1). The shuffle, moving same-key records onto one machine, is both the magic and the cost of distributed aggregation.

flowchart LR
  Raw[raw data: TBs in HDFS / S3] --> Spark[Spark cluster: clean + join + aggregate]
  Spark --> FT[feature table]
  FT --> Train[train model]

For ML specifically, Spark gives you MLlib (distributed training of classic models) and, more often today, acts as the feature-engineering engine that turns raw terabytes into the clean training table your deep-learning framework then consumes.

Warning

A Spark job that runs fine on a sample can explode in production on a skewed key — one value (say a null user-id) lands billions of rows on a single partition, and that one task hangs while the rest finish. Watch partition sizes, not just total volume.

29.7 — Distribution Shift and Monitoring in Depth

A deployed model is a photograph of the world taken on the day its training data was collected. The world keeps moving; the photograph does not. Distribution shift is the slow (or sudden) divergence between the data a model was trained on and the data it now sees in production. A fraud model trained before a new payment app launched, a demand forecaster trained before a viral TikTok, a clinical model trained at one hospital and deployed at another — all are looking at a world that no longer matches their photograph. Accuracy decays even though not a single line of code changed.

The key insight is that “the data changed” is too vague to act on. There are three distinct things that can drift, they have different causes, and — crucially — you detect them with different signals. Let \(X\) be the input features and \(Y\) the label. The model learns the joint distribution \(P(X, Y) = P(Y \mid X)\,P(X)\). Each factor can move independently.

Type of shift What moves Formal statement Plain example
Covariate shift The inputs \(P(X)\) changes, \(P(Y\mid X)\) fixed New user demographic signs up; the mapping from features to label is still valid
Label shift The output mix \(P(Y)\) changes, \(P(X\mid Y)\) fixed A disease becomes 5× more prevalent; symptoms per case unchanged
Concept drift The rule itself \(P(Y \mid X)\) changes “Normal” spending behavior redefined post-pandemic; same features, new meaning
Tip

The cheap-to-detect cases are the input-only ones. Covariate and label shift can be spotted from unlabeled production data, which you have immediately. Concept drift bends \(P(Y \mid X)\) — to see it you usually need labels, which arrive late or never. That asymmetry drives the whole monitoring design: watch the inputs continuously for early warning, and treat the (delayed) accuracy drop as confirmation.

flowchart TD
    A[Production data arrives] --> B{Labels available?}
    B -->|No, only X| C[Monitor P of X<br/>covariate / label shift]
    B -->|Yes, X and Y| D[Monitor performance<br/>accuracy, AUC, calibration]
    C --> E{Drift score > threshold?}
    D --> F{Metric drop > threshold?}
    E -->|yes| G[Alert: investigate]
    F -->|yes| G
    G --> H{Concept drift confirmed<br/>by labelled slice?}
    H -->|yes| I[Trigger retraining]
    H -->|no, just covariate| J[Maybe reweight / collect data]

Detecting input shift without labels

For a single feature, the workhorse is a two-sample test comparing a reference window (a frozen slice of training or early-production data) against a current window (recent traffic). For continuous features the Kolmogorov–Smirnov statistic — the largest gap between the two empirical CDFs — is the standard choice; for categorical features, a chi-square or the Population Stability Index (PSI).

PSI is the one you will meet most in industry because it is one number per feature and has battle-tested rules of thumb. Bin the feature, then:

\[\mathrm{PSI} = \sum_{i=1}^{b} \left( a_i - e_i \right)\, \ln\!\frac{a_i}{e_i}\]

In words: add up, across all \(b\) bins, the change in each bin’s share scaled by the log of its ratio — one scalar that grows as the current histogram pulls away from the reference one.

Also written: in vectorized NumPy this is simply ((a - e) * np.log(a / e)).sum() over the two bin-share vectors \(a\) and \(e\) — the exact line the code below computes.

where \(e_i\) is the expected fraction of mass in bin \(i\) (reference) and \(a_i\) the actual fraction (current). The convention: \(\mathrm{PSI} < 0.1\) no real shift, \(0.1\)–\(0.25\) moderate, \(> 0.25\) significant — investigate.

import numpy as np

def psi(expected, actual, bins=10):
    # bin edges from the reference distribution
    edges = np.quantile(expected, np.linspace(0, 1, bins + 1))
    edges[0], edges[-1] = -np.inf, np.inf
    e = np.histogram(expected, edges)[0] / len(expected)
    a = np.histogram(actual,   edges)[0] / len(actual)
    e, a = np.clip(e, 1e-6, None), np.clip(a, 1e-6, None)  # no log(0)
    return float(np.sum((a - e) * np.log(a / e)))

ref = np.random.normal(0, 1, 10000)        # training distribution
same = np.random.normal(0, 1, 10000)       # no shift
drift = np.random.normal(0.5, 1.3, 10000)  # shifted + wider

print(round(psi(ref, same),  4))   # ~0.001  -> stable
print(round(psi(ref, drift), 4))   # ~0.30   -> significant shift

The shifted-and-widened distribution lands well above \(0.25\), the no-shift case sits near zero. That single scalar, computed per feature each day, is what most production dashboards actually alert on.

In production you rarely compute these by hand across hundreds of features — a library does the sweep and renders a report. Evidently is the common open-source choice:

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])      # PSI/KS per column, automatically
report.run(reference_data=ref_df, current_data=live_df)
report.save_html("drift.html")                    # per-feature drift dashboard

Pair it with a scheduler (an Airflow/Prefect task on a daily cron) and you have a standing drift monitor that emits an alert the moment any feature crosses its threshold.

Warning

Run a KS or chi-square test on a million rows and everything is “significant” — with enough data the p-value detects shifts too tiny to matter. This is why PSI (an effect size, not a p-value) and fixed thresholds dominate in practice. When you do use hypothesis tests, correct for the many features you are testing (Bonferroni/FDR) or you will drown in false alarms.

A neat trick when you want a single global drift detector across all features at once is the domain classifier: pool reference and current samples, label them 0 and 1, and train a classifier to tell them apart. If it can’t beat 50% AUC, the two distributions are indistinguishable — no covariate shift. If it scores 0.85 AUC, something has clearly moved, and the classifier’s feature importances point straight at which features moved.

Distinguishing the three shifts in practice

A small worked scenario makes the diagnosis concrete. A loan-default model, reference vs. this month:

Signal observed Covariate Label Concept
Input distribution \(P(X)\) moved (PSI high)? yes no maybe
Predicted-label mix shifted but inputs stable? no yes maybe
Inputs and label mix both stable, yet accuracy fell? no no yes

If PSI flags income and region but, on the labeled subset that has matured, the model is still well-calibrated, you have covariate shift — the model is sound, you simply need training data covering the new region. If inputs look identical but the realized default rate doubled, that is label shift — recalibrate the threshold or reweight. If inputs and label rates are both steady yet accuracy has cratered, the relationship itself changed: concept drift, and only retraining on fresh labels fixes it.

From detection to action: retraining triggers and continual learning

Detecting drift is half the job; deciding when to retrain is the other half. Three trigger styles, in increasing sophistication:

flowchart LR
    subgraph Triggers
      S[Scheduled<br/>e.g. weekly] 
      P[Performance-based<br/>AUC drops below SLA]
      D[Drift-based<br/>PSI > 0.25 on key feature]
    end
    S --> R[Retrain candidate]
    P --> R
    D --> R
    R --> V[Validate vs current model<br/>on held-out + recent slice]
    V -->|better| Promote[Promote to challenger / prod]
    V -->|worse| Keep[Keep current, log]

Scheduled retraining is the simplest and surprisingly often the right answer — it has no detector to misfire, and if relabeling is cheap and drift is gradual, a weekly cron job beats an elaborate trigger nobody trusts. Performance-based triggers fire when a live metric breaches its SLA, but they require labels and so react late. Drift-based triggers fire on \(P(X)\) movement and so react early, at the cost of false alarms when a covariate shift is harmless.

The danger in all automated retraining is the feedback loop: the model influences what data it later sees (a fraud model blocks transactions, so it never observes their outcomes), and naive retraining bakes that bias in. Always validate a freshly retrained candidate against the incumbent on a common, recent, honestly labeled slice before promotion — never auto-deploy on “drift detected” alone.

When full retraining from scratch is too expensive, continual learning stages the update: warm-start from the current weights and fine-tune on recent data. The staging discipline that keeps this safe — hold out a recent window, retrain, compare, promote only on a win, keep the old model one click away — is exactly the testing-in-production machinery of the next section.

Warning

Continual fine-tuning on a stream invites catastrophic forgetting: the model overfits last week and loses skill on the long tail it learned months ago. Mitigations are a replay buffer (mix old data into each update) and validating on a stable held-out set that spans the full history, not just recent traffic.

29.8 — Testing in Production and the Serving Stack

Offline metrics lie by omission. A model can post a higher validation AUC and still lose money in production because of latency under real load, a feature that is computed differently at serving time than in training, or a subtle interaction with downstream business logic that no offline harness reproduces. The only fully faithful test environment is production itself. The art is exposing the new model to real traffic while keeping the blast radius small and the rollback instant. That is what the release strategies below buy you.

Four release strategies, from safest to most informative

flowchart TB
    subgraph Shadow["Shadow (mirror)"]
      U1[Live traffic] --> P1[Prod model] --> R1[Served to user]
      U1 -.copy.-> N1[New model] -.-> Log1[(Log only,<br/>not served)]
    end
    subgraph Canary["Canary (small %)"]
      U2[Live traffic] --> Split2{Router}
      Split2 -->|95%| P2[Prod model]
      Split2 -->|5%| N2[New model]
    end
    subgraph BG["Blue-Green (instant swap)"]
      U3[Live traffic] --> Router3{Router}
      Router3 -->|all| Blue[Blue = current]
      Green[Green = new, warm] -.flip on go.-> Router3
    end

The canary ramp is the strategy you will reach for most, and it is easier to feel as a slider: traffic creeps from the old model to the new one — 5%, 25%, 50%, 100% — and at any step a bad metric flips it instantly back to zero. Watch the new model’s share climb while it stays healthy:

canary: new model’s traffic share, ramping as it stays healthy old → new 5% 25% 50% 100% any bad metric → snap back to 0%

Shadow (dark launch). The new model receives a copy of every live request and produces predictions, but its output is logged, never served. Zero user risk — you are comparing the two models’ outputs on identical real inputs, and surfacing operational problems (latency, crashes, feature-pipeline errors) before any customer is affected. The limit: shadow mode cannot measure outcomes that depend on acting on the prediction. A shadow recommender’s clicks are unobservable because nobody saw its recommendations.

Canary. Route a small slice — say 5% — of real users to the new model and watch their metrics against the 95% control. If error rates or business KPIs degrade, roll back having harmed only a sliver of traffic. You ramp 5% → 25% → 50% → 100% as confidence grows. Canary measures real outcomes, unlike shadow, but exposes real users, so it comes second.

Blue-green. Keep two complete, warm environments: blue (current) and green (new). All traffic hits blue; green is fully deployed and tested out of band. To release, flip the router so 100% goes to green. Rollback is flipping back — seconds, not a redeploy. The cost is running double the infrastructure during the transition.

Interleaving. A trick specific to ranking and recommendation. Instead of splitting users between two rankers (A/B) you splice both rankers’ results into one list shown to every user — alternating or team-draft style — and attribute each click to whichever ranker contributed that item. Because every user experiences both, interleaving detects the better ranker with far fewer impressions (often 10–100× less traffic) than a classic A/B split, which is decisive when you have limited traffic or many rankers to compare.

Strategy User risk Measures real outcomes? Extra infra Best for
Shadow none no (logs only) model replica catching ops/skew bugs pre-launch
Canary low (small %) yes router + monitoring gradual rollout of any model
Blue-green low (instant rollback) yes (after flip) 2× environment fast, atomic cutover
Interleaving low yes, click-level result splicer comparing rankers cheaply

Champion / challenger

These strategies generalize into an evergreen pattern. The champion is the model currently serving production. One or more challengers run alongside — in shadow, or on a canary slice — continuously evaluated against the champion on the same live traffic. A challenger that beats the champion on the agreed metric over a sufficient window is promoted to champion; the old champion is demoted but kept warm for one-click rollback. This is the operational form of the retraining loop from the previous section: every freshly retrained candidate enters as a challenger, proves itself on production traffic, and only then takes the throne.

flowchart LR
    Champ[Champion<br/>serving 100%] -->|same live traffic| Eval{Challenger beats<br/>champion over window?}
    Chal[Challenger<br/>shadow / canary] --> Eval
    Eval -->|yes| Promote[Promote challenger,<br/>demote champion to warm standby]
    Eval -->|no| Retire[Retire challenger,<br/>keep champion]

The hidden failure mode: training–serving skew

The most common reason a model that aced offline evaluation underperforms live is training–serving skew: a feature is computed one way in the training pipeline and a different way at serving time. The classic culprits are a Python/pandas transform in the offline notebook reimplemented by hand in the online Java service, a unit mismatch (cents vs. dollars), or a time-leak where a training feature secretly used future information unavailable at request time.

A tiny worked example. Suppose a feature is “average purchase over the last 30 days.” Offline, the data scientist computes it over a calendar month; online, the service uses a rolling 30-day window that happens to exclude today’s pending order. Same name, different number, silently wrong predictions:

# training (offline): calendar-month mean, includes all settled orders
feat_train = orders["jan"].mean()          # e.g. (100+200+0)/3 = 100.0

# serving (online): rolling 30d, drops the still-pending order
feat_serve = orders["last_30d_settled"].mean()  # (100+200)/2 = 150.0
# -> model sees 150 in prod but learned the meaning of 100. Skew.

The structural fix is a feature store: a single system that computes each feature once and serves the same value to both training and inference. It has two synchronized faces — an offline store (columnar, high-throughput, for building training sets with point-in-time-correct joins that prevent leakage) and an online store (low-latency key-value, for fetching a feature vector in single-digit milliseconds at request time). Because both read from one feature definition, the calendar-vs-rolling discrepancy above cannot arise.

flowchart LR
    Raw[(Raw data)] --> FD[Feature definition<br/>computed ONCE]
    FD --> Off[(Offline store<br/>columnar, point-in-time joins)]
    FD --> On[(Online store<br/>low-latency KV)]
    Off --> Train[Training pipeline]
    On --> Serve[Inference service]
    Train --> MS[(Model store / registry<br/>versions, metrics, stage)]
    MS --> Serve

Its sibling is the model store (or model registry): the versioned catalog of trained model artifacts, each tagged with its training data snapshot, hyperparameters, offline metrics, and lifecycle stage (staging / production / archived). The registry is what makes champion/challenger and blue-green mechanical rather than heroic — promoting a challenger is a stage transition in the registry, and rollback is pointing the serving layer at the previous version’s artifact.

Tip

Feature store and model store answer the two halves of reproducibility. The feature store guarantees the same inputs offline and online; the model store guarantees the same artifact you evaluated is the one you serve. Skew enters wherever one of those guarantees is missing — which is why shadow mode, whose entire job is comparing offline-expected against online-actual on identical inputs, is the cheapest insurance you can buy before a real launch.

29.9 — Serving generative models: LLMOps

The intuition: everything so far assumed a model takes a feature vector and returns one number, fast and cheap. A large language model breaks all three assumptions at once — it takes variable-length text, returns one token at a time in a loop, and a single answer can cost a hundred forward passes instead of one. Serving it well is a different sport, and the practice has its own name: LLMOps.

Why the classic serving playbook strains:

Classic ML serving Generative / LLM serving
one forward pass per request one pass per output token (autoregressive loop)
fixed-size input/output variable prompt length, variable answer length
millisecond latency, kilobyte model seconds-long generations, multi-GB weights
metric = accuracy/AUC metric = quality, plus tokens/sec and $/1k tokens
cache the prediction cache the KV state and the prompt prefix

Autoregressive generation is the root of the difference: the model produces token 1, appends it to the input, produces token 2 from the longer sequence, and so on. The cost of a 500-token answer is roughly 500 sequential forward passes. The whole loop is best felt — watch each token get emitted, appended, and fed back in as the input to produce the next:

autoregressive decode — one token per forward pass, fed back in prompt KV cache is reuse each new token appended → becomes input for the next

The single most important optimization is the KV cache — the attention keys and values computed for tokens already generated are stored and reused, so each new token is one cheap step instead of re-reading the whole sequence. The serving cost then splits cleanly into two phases:

flowchart LR
  P[Prompt] --> Pre[Prefill:<br/>process whole prompt once<br/>compute-bound, fills KV cache]
  Pre --> Dec[Decode:<br/>1 token per step, reuse KV cache<br/>memory-bandwidth-bound]
  Dec -->|loop until EOS| Dec
  Dec --> Out[Completion]

In words: prefill reads the entire prompt in one compute-heavy pass and populates the KV cache; decode then emits tokens one at a time, each step cheap but bandwidth-bound — which is why long answers, not long prompts, dominate latency.

Three latency numbers matter, not one:

Metric Reads as Driven by
TTFT (time to first token) how long until the answer starts prefill (prompt length)
TPOT / ITL (time per output token) how fast it streams after that decode (model size, batch)
end-to-end TTFT + TPOT × output length both

A chat UI feels fast when TTFT is low and tokens stream, even if the full answer takes seconds — which is why generations are streamed to the client token-by-token rather than returned whole.

The throughput trick that makes LLM serving economical is continuous (in-flight) batching: instead of waiting for a fixed batch of requests to all finish (they finish at wildly different lengths), the server swaps a completed sequence out and a waiting one in at every decode step, keeping the GPU saturated. Combined with PagedAttention (managing the KV cache in non-contiguous “pages” like OS virtual memory, eliminating fragmentation), this is what serving engines like vLLM, TGI, and TensorRT-LLM deliver — often an order-of-magnitude more throughput than naive batching. In practice you point a server at a model and get an OpenAI-compatible endpoint:

# vLLM — continuous batching + PagedAttention behind one OpenAI-compatible server
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-3.1-8B-Instruct")   # weights loaded ONCE
params = SamplingParams(temperature=0.7, max_tokens=256)

out = llm.generate(["Explain KV caching in one sentence."], params)
print(out[0].outputs[0].text)
# serve: `vllm serve meta-llama/Llama-3.1-8B-Instruct`  -> POST /v1/chat/completions

Two cost levers are unique to this regime. Prompt caching reuses the prefill of a shared prefix — a long system prompt sent on every request is processed once and its KV state reused, cutting TTFT and cost for every subsequent call. And because most LLM apps call a hosted API rather than self-host, the cost unit is tokens, not requests: you are billed per input and output token, and output tokens usually cost several times more than input tokens (they each require a full forward pass).

# Token cost is the LLM FinOps unit — estimate before you ship a prompt
def cost(in_tok, out_tok, in_price=3.0, out_price=15.0):  # $ per 1M tokens
    return (in_tok * in_price + out_tok * out_price) / 1e6

# a verbose 800-token system prompt on 1M calls/day, 200-token answers:
print(round(cost(800 + 50, 200) * 1_000_000, 2))  # ~$5550/day
# trim the system prompt to 200 tokens -> the input line drops ~3x. Prompt length IS spend.

Evaluating generative output is the other hard part: there is no single ground-truth label to compute accuracy against. The toolkit is layered — automatic reference metrics where you do have a gold answer, an LLM-as-judge (a strong model scores outputs against a rubric) for open-ended quality at scale, and human review on a sampled slice for the cases that matter. Guardrails sit on the serving path too: output filtering for toxicity and PII, and grounding checks for hallucination (see Large Language Models and Ethics & Responsible AI).

Warning

Do not treat a hosted LLM endpoint as a stable artifact. The provider can update the model under the same name, silently shifting behavior — the generative version of concept drift. Pin model versions, keep a regression suite of prompts with expected-quality bars, and re-run it on every provider update exactly as you would gate a retrained model.

29.10 — Cost, efficiency & the FinOps of serving

A model that is accurate but ruinously expensive to serve never ships twice. Once a model leaves the notebook, its dominant cost is usually not training but inference — training happens once (or weekly); inference runs on every request, forever. A fraud model scoring a billion card swipes a month at even a tenth of a cent each is a million-dollar line item. MLOps owns this bill, and the discipline of watching and trimming it is sometimes called ML FinOps.

The intuition is a utility meter you forgot was running. A GPU instance left at 8% utilization costs the same as one at 90% — you pay for the hardware you reserved, not the work you did. The whole game is closing the gap between reserved and used.

The big levers, cheapest-effort first:

Lever What it does Typical win
Right-size the instance match GPU/CPU/memory to the actual model stop paying for an A100 to serve a logistic regression
Dynamic batching coalesce concurrent requests into one forward pass 5–20× throughput per GPU
Autoscale to zero spin replicas down when idle (serverless containers) pay only for traffic, not for 3 a.m. silence
Quantization / distillation smaller, faster model artifact (see Efficient Inference) 2–4× latency + cost cut
Spot / preemptible nodes cheap interruptible compute for batch and training 60–90% off, never for latency-critical online
Caching memoize repeated identical requests free hits on hot inputs

A worked back-of-envelope makes the batching lever vivid. Suppose one GPU forward pass takes 20 ms whether it processes 1 request or 16 (the GPU is bandwidth-bound, not compute-bound at this size). Serving one-at-a-time, you handle \(1000/20 = 50\) requests per second. Batch 16 together and you still spend 20 ms per pass but clear 16 requests in it — \(50 \times 16 = 800\) req/s on the same hardware. Same bill, 16× the work, which is exactly the gap between reserved and used closing.

A second, real-world worked example — autoscale-to-zero on a spiky internal tool. Say an analytics endpoint gets traffic only during business hours: ~8 hours of real load, 16 hours of silence. An always-on GPU replica bills 24 h/day. Scale it to zero when idle and you bill ~8 h/day — a 3× cut on that line item for one config flag, paid for only by a few seconds of cold-start on the first morning request. This is why “pay for traffic, not for 3 a.m. silence” is the cheapest lever on the table for any service whose load is bursty rather than constant.

Reserved vs. used GPU capacity no batching ~8% used — paying for idle batched ~90% used — same bill, more work
Tip

Track cost per 1,000 predictions as a first-class metric next to latency and accuracy. A 2% accuracy gain that triples serving cost is often a worse model in production terms. The cheapest inference is the one you never run — cache, batch, and autoscale-to-zero before you reach for bigger hardware.

29.11 — Security, governance & responsible deployment

A deployed model is software exposed to the internet, plus a data asset, plus a decision-maker that can harm people. Each of those three adds an attack surface and a governance duty that classic software ops does not have. Skipping this section is how an ML system that “works” still ends up on the front page.

The ML-specific threats. Beyond ordinary API security (auth, rate limits, input validation — never skip these at the trust boundary), models face attacks aimed at the model itself:

  • Adversarial examples — inputs perturbed just enough to flip the prediction (a few pixels change a “stop sign” into “speed limit” to the computer-vision model). Mitigation: input sanitization, adversarial training, anomaly detection on inputs.
  • Data / model poisoning — an attacker injects crafted samples into your training data (easy when you scrape the web or accept user feedback), planting a backdoor that survives into the deployed model. Mitigation: provenance tracking, training-data validation, anomaly filtering.
  • Model extraction / inversion — by querying your endpoint enough, an adversary reconstructs the model or recovers private training records. Mitigation: rate limits, query monitoring, differential privacy.
  • Prompt injection / jailbreaks (for LLM-backed services) — untrusted text in the input hijacks the model’s instructions. Mitigation: input/output filtering, privilege separation, never trusting model output as a command.

flowchart LR
    In[Untrusted input] -->|sanitize, rate-limit| Gate[Security gateway]
    Gate --> Model[Model service]
    Model -->|filter, log| Out[Output]
    Train[(Training data)] -->|provenance + validation| Model
    style Gate fill:#f59e0b,fill-opacity:0.2

Governance: the paper trail that proves you were responsible. When a model makes a decision that affects someone — a loan denial, a medical flag, a content takedown — regulators and your own incident response need to answer who shipped what, trained on which data, and why. The artifacts that provide this:

  • Model cards — a short document shipped with each model stating its intended use, training data, evaluation results broken down by subgroup (so disparate performance is visible, not hidden in an aggregate), and known limitations.
  • Lineage / audit logs — the registry chain from §29.4 (prod v7 ⇄ run #142 ⇄ dataset d-2026-06) is also a compliance artifact: it is what lets you reproduce and explain a contested decision months later.
  • Bias & fairness checks as a pipeline gate — fairness is not a one-time audit but a test in the CI pipeline that blocks promotion if a subgroup metric regresses, exactly like a unit test (see AI Ethics, Fairness & Safety).
  • Access control & PII handling — who can read the feature store, deploy a model, or pull training data; encryption and minimization for sensitive fields.

A tiny worked illustration of why subgroup evaluation is non-negotiable: a model can post 92% overall accuracy yet be 96% accurate on group A (80% of traffic) and only 76% on group B (20% of traffic). The aggregate looks healthy; group B is being failed one time in four. Only the disaggregated number — the kind a model card forces you to publish — surfaces it.

Group Share of traffic Accuracy Hidden in the 92%?
A 80% 96% masks the problem
B 20% 76% the real story
Overall 100% 92% looks fine

The “overall” number is just the traffic-weighted average of the groups — and that is exactly how a minority’s failure hides:

\[\text{acc}_{\text{overall}} = \sum_g w_g\,\text{acc}_g\]

In words: overall accuracy is each group’s accuracy weighted by its share of traffic — so a small group with poor accuracy barely moves the headline number, which is why you must read the groups, not the average.

Also written: \(\text{acc}_{\text{overall}} = 0.80 \times 0.96 + 0.20 \times 0.76 = 0.768 + 0.152 = 0.92\) — the 76% group is in the 92%, just drowned out by its small weight.

Warning

Treat the model registry as an audit log, not just a convenience store. The same lineage chain that helps you debug a bad prediction is what you hand a regulator after one. Build it before you need it — reconstructing “which model decided this” after the fact is usually impossible.

29.12 — Quick reference

Term / formula One-line meaning When / why it matters
Batch vs online serving scheduled bulk scoring vs per-request low-latency service default to batch; pay for online only when freshness is required
Model gateway single routed entry point for many model versions auth, logging, and canary/shadow traffic splits in one place
p50 / p95 / p99 latency percentiles hold an SLO on the tail (p99), never on the mean
SLO promised bound on a metric (“p99 < 150 ms”) what the serving stack is tuned to satisfy
DAG + orchestrator tasks as a dependency graph, run by Airflow/Kubeflow/Prefect runs independent branches in parallel, retries, resumes
Idempotency rerunning a step gives the same result overwrite-by-partition / upsert makes aggressive retries safe
Container / image code + deps + artifact frozen into immutable bytes (Docker) kills “works on my machine”; ship the exact bytes you tested
Kubernetes reconciliation declare desired replicas; K8s closes the desired-vs-observed gap self-healing fleet, readiness probes gate traffic
Experiment tracking + registry log run code/data/params/metrics, version artifacts by stage answers “which run/data made prod v7?” during an incident
\(\text{PSI}=\sum_i (a_i-e_i)\ln\frac{a_i}{e_i}\) symmetric drift score between live and training histograms \(<0.1\) stable, \(0.1\)–\(0.25\) watch, \(>0.25\) retrain
Covariate / label / concept drift \(P(X)\) / \(P(Y)\) / \(P(Y\mid X)\) moves inputs warn early (unlabeled); concept needs labels to confirm
Continuous training (CT) auto-retrain + validate + promote on drift/schedule the third CI/CD pipeline that ships a model when data moves
Shadow / canary / blue-green log-only / small-% / instant-swap release bound blast radius and keep rollback instant
Champion / challenger incumbent serves; rivals evaluated on live traffic, promoted on a win the evergreen form of safe testing-in-production
Training–serving skew a feature computed differently offline vs online the top reason an offline-good model fails live; fix with a feature store
Data parallelism + all-reduce \(\bar g=\frac1K\sum_k g_k\) replicate model, shard batch, average gradients scales most training; exact, until communication dominates
KV cache / continuous batching / PagedAttention reuse attention state, swap requests per decode step the throughput levers of LLM serving (vLLM/TGI)
TTFT / TPOT time to first token / time per output token streaming UX feels fast on low TTFT even for long answers
Cost per 1,000 predictions inference unit-cost as a first-class metric inference dominates lifetime cost; close reserved-vs-used gap

29.13 — Key takeaways

  • Deployment splits into batch vs online; default to batch, pay for online only when real-time freshness is required. Front multiple models with a gateway.
  • Latency is a tail, not a mean — hold an SLO on p99, and remember batching trades median for tail unless you bound the batch window.
  • Pipelines are DAGs run by an orchestrator (Airflow/Kubeflow/Prefect); idempotent, overwrite-by-partition steps are what make retries safe, and a data-validation gate catches corruption before training spends a GPU-hour.
  • Docker makes one environment reproducible; Kubernetes keeps a fleet of containers alive via a desired-vs-observed reconciliation loop (declared as YAML, with readiness probes gating traffic).
  • A model rots silently — wrap it in experiment tracking, a model/data registry, and drift detection (PSI), with continuous training as a third CI/CD pipeline that ships a new model when the data changes.
  • Distribution shift comes in three kinds — covariate (\(P(X)\)), label (\(P(Y)\)), concept (\(P(Y\mid X)\)); watch unlabeled inputs for early warning, treat the delayed accuracy drop as confirmation.
  • Ship new versions via shadow → canary → blue-green (interleaving for rankers) to bound blast radius and keep rollback instant; the evergreen form is champion/challenger, and the silent killer is training–serving skew, fixed by a feature store.
  • Data parallelism (replicate model, shard batch, all-reduce) scales most training; model parallelism is for models too big to fit one device; gradient communication is the usual scaling limit.
  • Spark (in-memory) supplanted Hadoop MapReduce (disk-bound) as the big-data processing engine that builds your training tables; mind skewed keys.
  • Generative models need LLMOps — autoregressive decoding, the KV cache, continuous batching + PagedAttention (vLLM/TGI), prompt caching, token-based cost, TTFT/TPOT latency, and LLM-as-judge evaluation; pin provider model versions.
  • Inference dominates lifetime cost; close the reserved-vs-used gap with right-sizing, dynamic batching, autoscale-to-zero, and quantization, and track cost per 1,000 predictions (or per-token) as a first-class metric.
  • A deployed model adds ML-specific attack surfaces (adversarial examples, poisoning, extraction, prompt injection) and governance duties — ship model cards with subgroup metrics, keep the registry as an audit log, and gate promotion on fairness checks.

29.14 — See also

  • AI Infrastructure & Efficient Inference — serving optimizations, quantization, and inference hardware that this chapter’s endpoints run on.
  • Tools & Frameworks — the broader library/framework landscape these pipelines are built from.
  • Model Evaluation & Tuning — the holdout metrics that gate continuous-training promotions.
  • Data Preprocessing — the transformations the pipeline and Spark layer actually perform.
  • Large Language Models — autoregressive decoding, KV caching, and the model-quality side of the LLMOps serving stack in §29.9.
  • Anomaly & Fraud Detection — a canonical online-serving, low-latency consumer of deployed models.
  • Ethics & Responsible AI — the fairness gates, hallucination guardrails, and subgroup auditing this chapter wires into the pipeline.

↪ The thread continues → Chapter 30 · 🚀 AI Infrastructure & Efficient Inference

MLOps coordinates the lifecycle; underneath sits the metal — GPUs, parallelism, quantization — that makes training and inference fast and affordable.


📖 All chapters  |  ← 28 · 🏦 ML Across Industries  |  30 · 🚀 AI Infrastructure & Efficient Inference →

 

© Kader Mohideen