🚢 ML in Production — MLOps · Lesson 9 — Monitoring in Production: Latency, Drift, and Knowing Before Your Users Do

🏠 🚢 Course home | ← Lesson 08 | Lesson 10 → | 📚 All mini-courses

Lesson 9 — Monitoring in Production: Latency, Drift, and Knowing Before Your Users Do

In the previous lesson you wired CI/CD so that a green pipeline ships the churn model to production automatically. Which raises an uncomfortable question: shipped to what fate? A deployed model is not a finished model — it is a model whose test set is now the real world, and the real world does not hold still. Customers change plans, marketing changes campaigns, an upstream team renames a column, and your accuracy quietly rots while the service keeps returning HTTP 200. In this lesson we build the two nervous systems every production model needs: ops monitoring (is the service healthy?) with Prometheus metrics baked into the FastAPI app from Lesson 6, and ML monitoring (is the model still right?) with Evidently drift reports comparing live traffic against training data. In the next lesson, Lesson 10 closes the loop by turning a drift alert into an automatic retrain — so what we build today is literally the trigger for the finale.

🎯 In this lesson you will: instrument the FastAPI service with Prometheus counters and histograms, query p95 latency and error rate, log predictions for later analysis, run an Evidently data-drift report (reference vs. current window) and export it as HTML, and write alerting thresholds that page a human at the right moments.

Two dashboards, two failure modes

The single most important idea in ML monitoring is that there are two independent ways to be broken, and they need different instruments:

	Ops metrics	ML metrics
Question	Is the service working?	Is the model still valid?
Examples	p95 latency, error rate, RPS, memory	data drift, prediction drift, delayed accuracy
Timescale	seconds to minutes	days to weeks
Detection	instant (a 500 is a 500)	statistical (needs a window of traffic)
Fix	rollback, scale up, restart	retrain, re-feature, investigate upstream data
Alert style	page someone now	file a ticket, investigate this week

A model can be ops-healthy and ML-dead: 8 ms p95, 0% errors, and every prediction wrong because the upstream billing system started reporting monthly_charges in halalas instead of riyals. Conversely it can be ML-healthy and ops-dead: perfectly calibrated predictions timing out at 30 s. You need both dashboards, and confusing them is how teams end up paging the on-call at 3 a.m. because a KS-test p-value dipped — a statistical observation that no human can act on before coffee.

There is a third, sneaky category: delayed labels. For churn, you predict today whether a customer will cancel, but you only know the true answer after the billing cycle closes — 30+ days later. So real accuracy is a metric you can only compute in arrears. The mermaid below shows how the three signals arrive on different clocks:

flowchart LR
    subgraph now["⏱ Real-time (seconds)"]
        A[Request] --> B[FastAPI service]
        B --> C["Ops metrics<br/>latency, errors, RPS"]
    end
    subgraph soon["📊 Near-time (hours-days)"]
        B --> D[(Prediction log)]
        D --> E["Drift analysis<br/>inputs & predictions<br/>vs. reference"]
    end
    subgraph later["🐢 Delayed (30+ days)"]
        F[(Ground-truth labels<br/>from billing)] --> G["True accuracy,<br/>precision/recall"]
        D --> G
    end
    C --> H{{Alerting}}
    E --> H
    G --> H

Drift monitoring exists precisely to fill the gap between “prediction made” and “label arrives”: if the inputs stop looking like training data, you have early warning that accuracy is probably degrading, weeks before you can prove it.

Instrumenting FastAPI with Prometheus

Prometheus works by pull: your service exposes a plain-text /metrics endpoint, and a Prometheus server scrapes it every 15 s or so. Your app never pushes anything, never blocks on a metrics backend, and if Prometheus is down your service doesn’t care. That inversion is why it became the default.

Install the official client into the Lesson 6 service environment (and add it to the requirements.txt that Lesson 5’s Docker image builds from):

pip install prometheus-client==0.20.0

Now extend service/app.py from Lesson 6. Stage one: define the instruments at module level — they must be created once, not per request:

# service/app.py  (additions to the Lesson 6 service)
import time

from fastapi import FastAPI, Request
from prometheus_client import Counter, Histogram, make_asgi_app

# --- instruments: module-level singletons -----------------------------
PREDICTIONS = Counter(
    "churn_predictions_total",
    "Predictions served, by predicted class",
    labelnames=["predicted_class"],           # "churn" / "no_churn"
)

REQUEST_LATENCY = Histogram(
    "churn_request_latency_seconds",
    "End-to-end /predict latency",
    buckets=(0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5),
)

PREDICTION_SCORE = Histogram(
    "churn_prediction_score",
    "Distribution of predicted churn probability",
    buckets=(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9),
)

REQUEST_ERRORS = Counter(
    "churn_request_errors_total",
    "Requests that raised an exception or returned 5xx",
)

Three design decisions worth dwelling on:

Counter vs Histogram. A counter only goes up (total predictions, total errors); Prometheus computes rates from it at query time with rate(). A histogram is a set of counters — one per bucket — from which quantiles like p95 are estimated. You never compute p95 in your app; you record raw observations and let the query layer aggregate. Doing it the other way (computing quantiles in-process and exporting a gauge) is a classic mistake: pre-computed quantiles from different replicas cannot be averaged — the average of two p95s is not the p95 of the combined traffic.
Buckets are your resolution. Prometheus estimates p95 by interpolating within the bucket where the 95th percentile falls. Default buckets top out patterns meant for generic web apps; ours are tuned so the interesting region for a scikit-learn model behind FastAPI (5–100 ms) has fine resolution. If all your traffic lands in one bucket, your p95 is a guess across that bucket’s whole width.
Label cardinality is a budget. predicted_class has 2 values → 2 time series. If you ever feel tempted to add customer_id as a label: don’t. Each unique label combination is a separate time series stored forever; per-user labels will take Prometheus down. Rule: labels are for dimensions you’d group by on a dashboard, never for identifiers.

Stage two: wire the instruments into the request path. Latency and errors belong in middleware (so they cover every route, including validation failures that never reach your handler); prediction-specific metrics belong in the handler:

app = FastAPI(title="churn-service")

# --- expose /metrics as a sub-application ------------------------------
app.mount("/metrics", make_asgi_app())

@app.middleware("http")
async def track_requests(request: Request, call_next):
    if request.url.path == "/metrics":          # don't measure the measurer
        return await call_next(request)
    start = time.perf_counter()
    try:
        response = await call_next(request)
    except Exception:
        REQUEST_ERRORS.inc()
        raise
    REQUEST_LATENCY.observe(time.perf_counter() - start)
    if response.status_code >= 500:
        REQUEST_ERRORS.inc()
    return response

make_asgi_app() returns a tiny ASGI app that renders the default registry in Prometheus’s exposition format; mounting it means /metrics is served by the same Uvicorn process — no sidecar, no extra port. Note the guard excluding /metrics itself from latency tracking: a scrape every 15 s is enough traffic to pollute your histogram with cheap non-inference requests, dragging p50 down and hiding regressions.

Stage three: instrument the prediction handler from Lesson 6:

@app.post("/predict")
def predict(payload: CustomerFeatures):          # Pydantic model from Lesson 6
    df = payload.to_frame()                      # 1-row DataFrame, training column order
    proba = float(model.predict_proba(df)[0, 1])
    label = "churn" if proba >= 0.5 else "no_churn"

    PREDICTIONS.labels(predicted_class=label).inc()
    PREDICTION_SCORE.observe(proba)

    return {"churn_probability": proba, "prediction": label}

Hit the service a few times and look at what Prometheus will see:

curl -s localhost:8000/metrics | grep churn_

churn_predictions_total{predicted_class="no_churn"} 41.0
churn_predictions_total{predicted_class="churn"} 9.0
churn_request_latency_seconds_bucket{le="0.005"} 12.0
churn_request_latency_seconds_bucket{le="0.01"} 47.0
churn_request_latency_seconds_bucket{le="0.025"} 50.0
churn_request_latency_seconds_bucket{le="+Inf"} 50.0
churn_request_latency_seconds_sum 0.412
churn_request_latency_seconds_count 50.0
churn_prediction_score_bucket{le="0.1"} 18.0
...

That’s the whole contract: cumulative buckets (le = “less than or equal”), a sum, and a count. Everything else — rates, quantiles, ratios — is computed at query time in PromQL. The three queries you will actually put on a dashboard:

# p95 latency over the last 5 minutes
histogram_quantile(0.95, rate(churn_request_latency_seconds_bucket[5m]))

# error rate as a fraction of traffic
rate(churn_request_errors_total[5m]) / rate(churn_request_latency_seconds_count[5m])

# fraction of predictions that are "churn" — cheap prediction-drift canary
rate(churn_predictions_total{predicted_class="churn"}[1h])
  / ignoring(predicted_class) sum(rate(churn_predictions_total[1h]))

That last one deserves a highlight: PREDICTION_SCORE and the churn-fraction query give you prediction drift in real time, before any label arrives and before any batch job runs. If the model predicted 18% churners all month and suddenly predicts 45%, either the world changed or your input pipeline broke — both are worth knowing within the hour, and a plain Prometheus alert can catch it.

For local verification, a minimal scrape config (prometheus.yml) pointed at the container from Lesson 5:

scrape_configs:
  - job_name: churn-service
    scrape_interval: 15s
    static_configs:
      - targets: ["host.docker.internal:8000"]

docker run -p 9090:9090 \
  -v $PWD/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus:v2.53.0

Open http://localhost:9090, paste the p95 query, and you have a live latency chart against your Lesson 5 container.

Logging predictions — the raw material for drift

Ops metrics are aggregates; drift analysis needs the actual rows. So the second thing the service must do is write every request’s features and output somewhere a batch job can read. In real systems that’s Kafka, BigQuery, or S3-partitioned Parquet; the pattern is identical with a local JSONL file, so that’s what we’ll use — the analysis code doesn’t care where the rows came from.

# service/predlog.py
import json
from datetime import datetime, timezone
from pathlib import Path

LOG_DIR = Path("prediction_logs")
LOG_DIR.mkdir(exist_ok=True)

def log_prediction(features: dict, proba: float, model_version: str) -> None:
    record = {
        "ts": datetime.now(timezone.utc).isoformat(),
        "model_version": model_version,       # from the Lesson 4 registry stage
        **features,
        "churn_probability": proba,
    }
    day_file = LOG_DIR / f"{datetime.now(timezone.utc):%Y-%m-%d}.jsonl"
    with day_file.open("a") as f:
        f.write(json.dumps(record) + "\n")

And one line in the handler, after computing proba:

    log_prediction(payload.model_dump(), proba, model_version=MODEL_VERSION)

Two details that pay off later. Log the features the model actually saw, post-validation — not the raw request body — so your drift analysis measures the model’s true input distribution, not client typos that Pydantic already rejected. And log the model version with every row: when Lesson 10’s retraining deploys v7 mid-week, you must be able to separate “the data drifted” from “the model changed” — mixing predictions from two model versions in one drift window is the most common way to hallucinate prediction drift.

flowchart TD
    A["/predict handler"] --> B["Prometheus instruments<br/>(aggregates, real-time)"]
    A --> C["Prediction log<br/>(raw rows, JSONL/Parquet)"]
    B --> D["Grafana / Alertmanager<br/>p95, error rate, churn-fraction"]
    C --> E["Evidently batch job<br/>(daily cron / CI step)"]
    F[("Reference window<br/>= training data snapshot<br/>from Lesson 2")] --> E
    E --> G["drift_report.html"]
    E --> H{"drift share > 0.3?"}
    H -- yes --> I["Ticket / trigger Lesson 10 retrain"]
    H -- no --> J["✓ archive report"]

Drift detection with Evidently

Now the ML side. Data drift is a change in the input distribution \(P(X)\); prediction drift is a change in the output distribution \(P(\hat{y})\). Neither proves accuracy dropped — that would require labels — but each is measurable today, and large input drift is the best leading indicator you can get for free.

The mechanic is always reference vs. current: freeze a snapshot of the data the model was trained on (Lesson 2 made this reproducible — same commit, same data hash), take a recent window of production traffic from the prediction log, and compare the two, column by column, with statistical tests.

We’ll use Evidently (the current ≥0.7 API — it changed substantially from the old ColumnMapping era, so beware stale tutorials):

pip install evidently==0.7.5 pandas

Stage one of monitoring/drift_check.py: load the two windows.

# monitoring/drift_check.py
import json
import sys
from pathlib import Path

import pandas as pd

FEATURES_NUM = ["tenure_months", "monthly_charges", "total_charges", "support_tickets"]
FEATURES_CAT = ["contract_type", "payment_method", "internet_service"]
COLUMNS = FEATURES_NUM + FEATURES_CAT

def load_reference(path: str = "data/train_reference.parquet") -> pd.DataFrame:
    """The Lesson 2 training snapshot — same rows the model learned from."""
    return pd.read_parquet(path)[COLUMNS]

def load_current(log_dir: str = "prediction_logs", days: int = 7) -> pd.DataFrame:
    files = sorted(Path(log_dir).glob("*.jsonl"))[-days:]
    rows = [json.loads(line) for f in files for line in f.open()]
    df = pd.DataFrame(rows)
    return df[COLUMNS + ["churn_probability", "model_version"]]

The reference window is not negotiable: it must be the actual training data (or a stratified sample of it), pinned by Lesson 2’s data versioning. Using “last month’s production traffic” as reference measures recent change but silently accepts any slow drift that already happened — fine as a second comparison, wrong as the only one.

Stage two: run the drift report.

from evidently import Dataset, DataDefinition, Report
from evidently.presets import DataDriftPreset

def run_drift(reference: pd.DataFrame, current: pd.DataFrame):
    definition = DataDefinition(
        numerical_columns=FEATURES_NUM,
        categorical_columns=FEATURES_CAT,
    )
    ref_ds = Dataset.from_pandas(reference[COLUMNS], data_definition=definition)
    cur_ds = Dataset.from_pandas(current[COLUMNS], data_definition=definition)

    report = Report([DataDriftPreset()])
    result = report.run(current_data=cur_ds, reference_data=ref_ds)

    result.save_html("drift_report.html")     # the full visual report
    return result

What DataDriftPreset does under the hood: for each column it picks a sensible two-sample test based on type and size — Kolmogorov–Smirnov for numerical columns on small samples, chi-squared for categorical, and Wasserstein distance / Jensen–Shannon divergence on large samples (with >1000 rows, p-value tests become uselessly sensitive: they’ll flag statistically-significant-but-practically-irrelevant differences, so Evidently switches to effect-size metrics with a threshold instead). Each column gets a drifted/not-drifted verdict, and the dataset-level verdict fires when the share of drifted columns crosses 50% by default.

Stage three: extract a machine-readable verdict, because an HTML report nobody opens is not monitoring:

def summarize(result) -> dict:
    payload = json.loads(result.json())
    metrics = {m["metric_id"]: m["value"] for m in payload["metrics"]}
    # DriftedColumnsCount lives in the preset output; key includes its params
    drifted = next(v for k, v in metrics.items() if k.startswith("DriftedColumnsCount"))
    return {
        "n_drifted": int(drifted["count"]),
        "share_drifted": float(drifted["share"]),
    }

if __name__ == "__main__":
    ref, cur = load_reference(), load_current()
    result = run_drift(ref, cur)
    summary = summarize(result)
    print(json.dumps(summary))
    sys.exit(1 if summary["share_drifted"] > 0.3 else 0)

The sys.exit(1) is the whole point: this script is designed to run as a daily cron or a scheduled job in the Lesson 8 CI system, and a non-zero exit is what turns “drift happened” into a red job, a notification, and — on Lesson 10 — a retraining trigger. We set our gate at 0.3 (30% of columns drifted), stricter than Evidently’s 0.5 default, because churn features are few and correlated: if 3 of 7 move together, something real happened upstream.

Stage four: prove it works by manufacturing drift. This doubles as the test that the pipeline detects what it claims to detect:

# monitoring/simulate_drift.py — sanity-check the detector
import numpy as np
from drift_check import load_reference, run_drift, summarize

ref = load_reference()

# a "current" window drawn from reference = no drift expected
calm = ref.sample(2000, random_state=0)

# a shifted window: price increase + contract mix change (a realistic combo)
rng = np.random.default_rng(42)
shifted = ref.sample(2000, random_state=1).copy()
shifted["monthly_charges"] *= rng.normal(1.25, 0.05, len(shifted))  # +25% pricing
shifted["tenure_months"] = (shifted["tenure_months"] * 0.6).astype(int)  # newer cohort
shifted.loc[shifted.sample(frac=0.4, random_state=2).index,
            "contract_type"] = "month-to-month"

print("calm   :", summarize(run_drift(ref, calm)))
print("shifted:", summarize(run_drift(ref, shifted)))

calm   : {"n_drifted": 0, "share_drifted": 0.0}
shifted: {"n_drifted": 3, "share_drifted": 0.429}

Open drift_report.html from the shifted run: you’ll see per-column distribution overlays exactly like the SVG above — monthly_charges and tenure_months flagged with their drift scores, categorical frequency bars for contract_type side by side. This report is what you attach to the ticket; the JSON summary is what the machine acts on.

One more window to watch: run the same preset on the churn_probability column alone (reference = the model’s scores on a held-out training slice, current = logged production scores). That is prediction drift, and it fires even when no single input column moves much — small correlated input shifts can compound into a large output shift, and the output is ultimately what the business consumes.

Alerting rules of thumb

Metrics without alerts are a screensaver. But alert design is where monitoring efforts die — too noisy and people mute the channel, too quiet and you learn about incidents from customers. The rules of thumb that survive contact with production:

1. Page on symptoms, ticket on causes. Pages (wake-a-human alerts) are only for user-visible breakage: error rate, latency, service down. Drift is a cause — it degrades quality over days, and nothing improves by fixing it at 3 a.m. Drift goes to a ticket queue or a Slack channel reviewed each morning.

2. Alert on rates and windows, never on single events. One 500 means nothing; 5% of requests failing for 5 minutes means everything.

3. Every alert names an action. If the runbook entry for an alert is “look into it,” delete the alert.

Concretely, the Prometheus alert rules for the churn service (alerts.yml):

groups:
  - name: churn-service-ops        # PAGE-worthy
    rules:
      - alert: HighErrorRate
        expr: |
          rate(churn_request_errors_total[5m])
            / rate(churn_request_latency_seconds_count[5m]) > 0.05
        for: 5m
        labels: {severity: page}
        annotations:
          summary: ">5% errors for 5m — rollback to previous registry version (Lesson 4)"

      - alert: HighLatencyP95
        expr: |
          histogram_quantile(0.95,
            rate(churn_request_latency_seconds_bucket[5m])) > 0.5
        for: 10m
        labels: {severity: page}
        annotations:
          summary: "p95 > 500ms for 10m — check replica count / model size"

  - name: churn-service-ml         # TICKET-worthy
    rules:
      - alert: PredictionShiftedUp
        expr: |
          (rate(churn_predictions_total{predicted_class="churn"}[6h])
            / ignoring(predicted_class) sum(rate(churn_predictions_total[6h])))
          > 1.5 * (rate(churn_predictions_total{predicted_class="churn"}[7d] offset 1d)
            / ignoring(predicted_class) sum(rate(churn_predictions_total[7d] offset 1d)))
        for: 6h
        labels: {severity: ticket}
        annotations:
          summary: "Churn-rate prediction 50% above trailing week — check drift report"

Note the pattern in PredictionShiftedUp: the threshold is relative to the model’s own recent history (offset 1d over a 7-Lesson window), not an absolute number. Absolute thresholds on ML metrics rot as the business changes; ratios against a trailing baseline age far better. The Evidently job complements this with its own thresholds — a reasonable starting matrix:

Signal	Threshold	Severity	Action
Error rate 5m	> 5%	page	rollback via Lesson 4 registry alias
p95 latency 10m	> 500 ms	page	scale out / inspect payload sizes
Prediction churn-share 6h	> 1.5× trailing week	ticket	open latest drift report
Evidently drift share (daily)	> 0.3	ticket	investigate columns, consider retrain
Evidently drift share (daily)	> 0.6	ticket + auto-retrain candidate	Lesson 10 pipeline
Delayed accuracy (monthly)	< training AUC − 0.05	ticket	mandatory retrain

Start loose, tighten based on false-positive counts. A drift alert that fires weekly with no action taken is worse than no alert — it trains everyone to ignore the channel that will one day carry the real one.

LLM observability: same skeleton, different organs

Everything above transfers to the vLLM service from Lesson 7, but LLM serving adds axes that a churn classifier doesn’t have — worth naming so you know what to reach for, even though deep LLM evaluation is its own discipline:

Ops metrics get token-shaped. Latency splits into time-to-first-token (what streaming users feel) and inter-token latency; throughput is tokens/sec, not requests/sec. vLLM exposes these on its own /metrics endpoint out of the box (vllm:time_to_first_token_seconds, vllm:generation_tokens_total, KV-cache usage) — scrape it with the exact same Prometheus config we wrote above.
Cost is a first-class metric. Add a Counter for prompt and completion tokens per route/tenant; tokens × price is a dashboard, and a per-tenant daily budget alert has saved more than one team from a runaway agent loop.
Tracing replaces the prediction log. An LLM app is a pipeline — retrieval, prompt assembly, model call, tool calls — and debugging quality means seeing the whole trace, not one row. This is what OpenTelemetry-based LLM tracing tools (Langfuse, Arize Phoenix, LangSmith) do: every span records the prompt, completion, token counts, and latency, nested by call.
“Drift” becomes eval scores. There’s no KS test for prose. The equivalent of Evidently here is running a fixed eval set (plus LLM-as-judge scoring on sampled production traffic) on a schedule and alerting on score regressions — same reference-vs-current skeleton, statistical test swapped for an eval harness.

The mental model holds: real-time aggregates for ops, logged raw records for quality analysis, delayed ground truth (human feedback instead of billing labels) closing the loop.

🧪 Your task

The service currently detects input drift, but the earliest business-visible signal is often the model’s output. Your job: write monitoring/prediction_drift.py that compares the distribution of churn_probability between a reference window (the model’s scores on the Lesson 2 validation split — generate and save them if you haven’t) and the last 7 days of the prediction log, using Evidently on that single column. It must save prediction_drift.html and exit non-zero when the score distribution has drifted. Then verify both directions: run it on log data sampled from the reference scores (must pass) and on scores you shift by adding +0.15 with clipping to [0, 1] (must fail).

Hint: you don’t need a preset for one column — use DataDriftPreset(columns=["churn_probability"]), or the single metric ValueDrift(column="churn_probability") from evidently.metrics. The column is numerical; with a few thousand rows Evidently will use Wasserstein distance, so your shifted test should trip it comfortably.

Solution

# monitoring/prediction_drift.py
import json
import sys
from pathlib import Path

import numpy as np
import pandas as pd
from evidently import Dataset, DataDefinition, Report
from evidently.presets import DataDriftPreset

COL = "churn_probability"

def load_reference_scores(path: str = "data/val_scores.parquet") -> pd.DataFrame:
    """Model's predict_proba on the Lesson 2 validation split, saved at train time:
       pd.DataFrame({COL: model.predict_proba(X_val)[:, 1]}).to_parquet(path)"""
    return pd.read_parquet(path)[[COL]]

def load_current_scores(log_dir: str = "prediction_logs", days: int = 7) -> pd.DataFrame:
    files = sorted(Path(log_dir).glob("*.jsonl"))[-days:]
    rows = [json.loads(line) for f in files for line in f.open()]
    return pd.DataFrame(rows)[[COL]]

def check(reference: pd.DataFrame, current: pd.DataFrame,
          html_out: str = "prediction_drift.html") -> bool:
    definition = DataDefinition(numerical_columns=[COL])
    ref_ds = Dataset.from_pandas(reference, data_definition=definition)
    cur_ds = Dataset.from_pandas(current, data_definition=definition)

    report = Report([DataDriftPreset(columns=[COL])])
    result = report.run(current_data=cur_ds, reference_data=ref_ds)
    result.save_html(html_out)

    payload = json.loads(result.json())
    metrics = {m["metric_id"]: m["value"] for m in payload["metrics"]}
    drifted = next(v for k, v in metrics.items()
                   if k.startswith("DriftedColumnsCount"))
    return drifted["count"] > 0

if __name__ == "__main__":
    ref = load_reference_scores()

    if "--self-test" in sys.argv:
        rng = np.random.default_rng(0)
        calm = ref.sample(2000, random_state=0)
        shifted = ref.sample(2000, random_state=1).copy()
        shifted[COL] = (shifted[COL] + 0.15).clip(0.0, 1.0)

        assert not check(ref, calm, "pd_calm.html"), "false positive on calm data"
        assert check(ref, shifted, "pd_shifted.html"), "missed an injected +0.15 shift"
        print("self-test OK: calm passes, shifted trips")
        sys.exit(0)

    cur = load_current_scores()
    drifted = check(ref, cur)
    print(json.dumps({"prediction_drift": drifted, "n_current": len(cur)}))
    sys.exit(1 if drifted else 0)

Run python monitoring/prediction_drift.py --self-test — both assertions must pass. Wire the non-self-test invocation into the same daily job as drift_check.py; on Lesson 10, this exit code becomes one of the retraining triggers.

Key takeaways

A production model has two independent failure modes — service breakage and model invalidation — and each needs its own metrics, timescale, and alert style.
Prometheus instrumentation is three module-level instruments and a middleware: record raw counters and histogram observations, compute p95 and rates in PromQL, never pre-aggregate quantiles in the app.
Keep label cardinality tiny; a label is a dashboard dimension, not an identifier.
Log every prediction’s post-validation features, score, and model version — that log is the raw material for all quality analysis.
Drift detection is always reference-vs-current: pin the reference to the actual training snapshot, run Evidently on a schedule, gate on the drifted-column share, and make the job exit non-zero so machines can react.
Delayed labels mean true accuracy is a lagging metric; input drift and prediction drift are the leading indicators that fill the gap.
Page on symptoms (errors, latency), ticket on causes (drift); every alert must name an action, and ML thresholds should be relative to trailing baselines, not absolutes.
LLM observability keeps the same skeleton with token-level ops metrics, per-request tracing, cost counters, and eval suites in place of statistical drift tests.

In the next lesson, Lesson 10: we stop reading drift reports and start acting on them — continuous training, automated promotion through the registry, and closing the full MLOps loop.

🏠 🚢 Course home | ← Lesson 08 | Lesson 10 → | 📚 All mini-courses