Chapter 31 — 🧰 Tools & Frameworks

📖 All chapters | ← 30 · 🚀 AI Infrastructure & Efficient Inference | 32 · 🧭 Search & Problem Solving →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Every model in this encyclopedia ultimately runs as code on top of a small number of well-worn libraries. This chapter is the practical workshop tour: the Python data stack that almost all ML touches, the two dominant deep learning frameworks and when to reach for each, the model-sharing ecosystem (Hugging Face) that has become the default way to grab a pretrained network, and the streaming systems that feed models fresh data in real time. The goal is not to make you memorize APIs but to understand the shape of each tool — its core abstraction, the idiom you write over and over, and the tradeoff that made it win.

🧭 In context: Production, Tooling & Infrastructure · the concrete libraries you write ML in · one key idea — each tool exposes a single core abstraction (the array, the DataFrame, the estimator, the tensor-with-gradient, the pretrained model, the event stream) and everything else is convenience around it.

💡 Remember this: every tool here is just convenience wrapped around one core abstraction — learn that abstraction (array, DataFrame, estimator, tensor-with-gradient, pretrained model, event stream) and the API falls out of it.

31.1 — The Python ML stack

Python won machine learning not because it is fast — it is not — but because it is a thin, pleasant scripting layer over fast C and Fortran. The trick is that the hot loops live in compiled libraries, and Python just orchestrates: you write model.fit(X, y) in a high-level language, and underneath the actual matrix multiplies run in optimized native code. Three libraries form the base of almost every ML project: NumPy for numerical arrays, Pandas for tabular data, and scikit-learn for classical models. They stack cleanly: Pandas is built on NumPy, and scikit-learn consumes both.

Think of it like a kitchen. NumPy is the cutting board and knives (raw fast operations on ingredients), Pandas is the labeled pantry with everything in named jars, and scikit-learn is the set of standard recipes that all follow the same steps. You move ingredients up the stack: messy delivery box → labeled pantry → measured bowls → recipe.

flowchart TD
  A[Raw data: CSV, SQL, JSON] --> B[Pandas DataFrame<br/>clean, join, group]
  B --> C[NumPy array<br/>numeric matrix X, y]
  C --> D[scikit-learn<br/>fit / predict / transform]
  D --> E[Model + metrics]
  style B fill:#cde,stroke:#369
  style C fill:#dec,stroke:#393
  style D fill:#edc,stroke:#963

NumPy: the n-dimensional array

The core object is the ndarray — a contiguous block of memory holding numbers of one fixed type (the dtype, e.g. float64), described by a shape (a tuple of dimension sizes). Because the data is contiguous and uniformly typed, operations on it run as single compiled loops rather than per-element Python bytecode. This is vectorization: you express a + b over a million elements as one expression, and NumPy runs the loop in C. A pure-Python for loop over a million numbers might take a second; the vectorized version takes milliseconds, because Python’s per-element overhead (type checks, boxing) vanishes.

The mental model: an ndarray is a grid with a shape. A vector is shape (n,), a matrix (rows, cols), a color image (height, width, 3). Operations either work element-wise or follow broadcasting — when two shapes differ, NumPy stretches the smaller one along size-1 (or missing) dimensions instead of physically copying it, so you can add a row vector to every row of a matrix for free.

Worked example: standardize a tiny dataset (subtract the column mean, divide by the column standard deviation) — the single most common preprocessing step, done with broadcasting and no loops.

import numpy as np

X = np.array([[2.0, 100.0],     # 3 samples, 2 features
              [4.0, 300.0],
              [6.0, 500.0]])

mu  = X.mean(axis=0)            # per-column mean -> shape (2,)  = [4.0, 300.0]
sig = X.std(axis=0)            # per-column std  -> shape (2,)  = [1.633, 163.3]
Xs  = (X - mu) / sig           # broadcast (3,2)-(2,) then /(2,)
# row 0 -> [-1.225, -1.225], row 1 -> [0,0], row 2 -> [1.225, 1.225]
print(Xs.mean(axis=0), Xs.std(axis=0))  # ~[0,0]  [1,1]

Notice the loop count in Python: zero. X - mu broadcasts the (2,) mean across all 3 rows; the division does the same. The whole standardization is three array expressions, and axis=0 is the key idea — it says “collapse down the rows, one result per column.”

The standardization formula itself, written once, applied per feature column $j$:

\[z_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}, \qquad \mu_j = \frac{1}{n}\sum_{i=1}^{n} x_{ij}, \qquad \sigma_j = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(x_{ij}-\mu_j)^2}\]

In words: for each value, subtract that column’s average and divide by that column’s spread, so every feature ends up centered at 0 with a typical size of 1 — putting “price in riyals” and “age in years” on the same scale. Also written: in vectorized form, $Z = (X - \mathbf{1}\,\mu^\top) \oslash (\mathbf{1}\,\sigma^\top)$, where $\mathbf{1}$ is a column of ones, $\oslash$ is element-wise division, and the broadcasting in (X - mu) / sig does exactly this row-replication for you.

Broadcasting deserves a picture, because shape mismatches are where beginners trip:

Pandas: labeled tabular data

NumPy arrays are anonymous grids. Real data has named columns of mixed types — a “price” float, a “city” string, a “date” timestamp — plus missing values. Pandas wraps NumPy with two labeled objects: the Series (one labeled column) and the DataFrame (a dict of aligned Series sharing one row index). The index is the superpower: operations align on labels, so adding two DataFrames matches rows by index, not by position — a quiet correctness win when your tables are sorted differently.

The idiom you write most is split–apply–combine via groupby: split rows into groups, apply an aggregation to each group, combine the results into one table. The everyday analogy: hand a stack of receipts to a clerk, say “make one pile per city, then total each pile” — groupby is the clerk.

import pandas as pd

df = pd.DataFrame({
    "city":  ["Riyadh", "Riyadh", "Jeddah", "Jeddah"],
    "sales": [100, 140, 90, 60],
})
df["sales"] = df["sales"].fillna(0)             # handle missing values
by_city = df.groupby("city")["sales"].mean()    # split-apply-combine
# Riyadh -> 120.0,  Jeddah -> 75.0

flowchart LR
  A["4 rows<br/>(mixed cities)"] -->|split by city| B[Riyadh: 100,140]
  A -->|split by city| C[Jeddah: 90,60]
  B -->|mean| D[120.0]
  C -->|mean| E[75.0]
  D --> F[Result Series]
  E --> F

Pandas is where preprocessing lives — joining tables, parsing dates, encoding categories, filling gaps — before you hand a clean numeric matrix to a model. (Chapter Data Preprocessing covers the what of these transforms; this is the with-what.)

Tip

Rule of thumb: stay vectorized in Pandas too. A df.apply(..., axis=1) row loop is Python speed; the same logic as a column expression (df["a"] * df["b"]) or a groupby runs in C. Reach for .apply only when no vectorized form exists.

scikit-learn: the estimator API

scikit-learn’s lasting contribution is not its algorithms but its consistent interface. Every model is an estimator with at most three methods, and they mean the same thing everywhere:

Method	Meaning	Used by
`fit(X, y)`	learn parameters from data	every estimator
`predict(X)`	produce outputs for new rows	classifiers, regressors
`transform(X)`	map inputs to a new representation	scalers, encoders, PCA

Because the interface is uniform, you can swap a LogisticRegression for a RandomForestClassifier by changing one line — the surrounding code does not care. And because transformers and predictors share fit, you can chain them into a Pipeline: a single estimator that applies each step’s transform in order, then fit/predict on the final model. The Pipeline solves a subtle but serious bug — data leakage. If you scale using statistics computed over the whole dataset before splitting, information from the test rows bleeds into training, and your reported accuracy is optimistically inflated. A Pipeline fit inside cross-validation computes the scaler’s mean and standard deviation on the training fold only.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

pipe = Pipeline([
    ("scale", StandardScaler()),      # transform: learns mean/std on train fold
    ("clf",   LogisticRegression()),  # predict: final estimator
])
# cross_val_score re-fits the WHOLE pipe per fold -> scaler never sees test rows
scores = cross_val_score(pipe, X, y, cv=5)
print(scores.mean())

flowchart LR
  X[X raw] --> S["StandardScaler<br/>fit_transform on train"]
  S --> C["LogisticRegression<br/>fit"]
  C --> P[predict]
  classDef t fill:#dec,stroke:#393
  classDef m fill:#edc,stroke:#963
  class S t
  class C,P m

Tip

Rule of thumb: if a step learns anything from data (a mean, a vocabulary, a set of PCA axes), it belongs inside the Pipeline so it is re-learned per fold. Stateless constants (e.g. np.log1p) can live outside.

Warning

Common mistake: calling scaler.fit(X) on the full dataset, then splitting. The scaler has already seen the test rows — your reported accuracy is optimistically biased. Always fit on training data only; a Pipeline makes this automatic.

Mixed columns and hyperparameter search

Real tables mix numeric and categorical columns, and each kind needs a different transform. The everyday picture: a sorting station where number-jars go down one chute (scale them) and word-jars down another (one-hot encode them), and both feed the same model. scikit-learn’s ColumnTransformer is that station, and it composes inside a Pipeline so leakage protection still holds. Tuning is the same uniform interface: GridSearchCV wraps any estimator (a Pipeline included) and re-fits it for every hyperparameter combination, scoring each by cross-validation.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

pre = ColumnTransformer([
    ("num", StandardScaler(),                 ["age", "amount"]),
    ("cat", OneHotEncoder(handle_unknown="ignore"), ["city"]),
])
pipe = Pipeline([("pre", pre), ("clf", RandomForestClassifier())])

grid = GridSearchCV(pipe,
    {"clf__n_estimators": [100, 300], "clf__max_depth": [4, 8]},
    cv=5)
grid.fit(X, y)                 # tries 2*2=4 settings x 5 folds = 20 fits
print(grid.best_params_, grid.best_score_)

The clf__n_estimators double-underscore syntax reaches into a named step — <step>__<param> — so you tune a deeply nested component from the top level. The whole search, leakage-safe preprocessing included, is one object you can joblib.dump and ship.

Gradient-boosting libraries: the tabular workhorses

Here is a fact that surprises people who arrive from deep learning: on ordinary tabular data — the spreadsheets, transaction logs, and customer tables that make up most real business problems — a gradient-boosted decision tree usually beats a neural network, trains in seconds, and needs almost no tuning. So while scikit-learn ships its own boosting, three specialized libraries dominate this corner: XGBoost, LightGBM, and CatBoost. They all implement the same core idea (covered in depth in Ensemble Methods): build trees one after another, each new tree correcting the errors the running total still makes.

The intuition for why they win on tables: trees split on thresholds (amount > 500?), so they handle mixed scales, skewed distributions, and irrelevant columns gracefully — no standardization needed, no sensitivity to a feature measured in millions sitting next to one measured in fractions. Boosting then stacks many shallow trees into a strong learner. The three libraries differ mainly in engineering tradeoffs:

Library	Distinctive trick	Sweet spot
XGBoost	regularized objective, level-wise trees, battle-tested	the safe default; Kaggle staple
LightGBM	histogram bucketing + leaf-wise growth = very fast	large datasets, many features
CatBoost	native categorical handling, ordered boosting	many categorical columns, less tuning

All three expose a scikit-learn-compatible fit/predict, so they drop straight into a Pipeline or GridSearchCV:

from xgboost import XGBClassifier          # pip install xgboost
from sklearn.model_selection import cross_val_score

clf = XGBClassifier(
    n_estimators=300, max_depth=4, learning_rate=0.1,
    subsample=0.8, eval_metric="logloss")
print(cross_val_score(clf, X, y, cv=5).mean())

# LightGBM is a near drop-in replacement, usually faster on big data:
# from lightgbm import LGBMClassifier; clf = LGBMClassifier(n_estimators=300)

Tip

Rule of thumb: if your data is a table (rows and columns, not images/text/audio), reach for gradient boosting before a neural net. It is the strongest baseline you can stand up in five minutes, and often the final model too. Save the deep network for unstructured data.

31.2 — Deep learning frameworks

Classical ML stops where you need to hand-build a network and have its gradients computed for you. That is the job of a deep learning framework. Both major frameworks rest on the same two pillars: tensors (NumPy-like arrays that also live on GPUs, so the same array math runs thousands of multiplies in parallel) and autograd (automatic differentiation that records every operation and replays it backward to compute gradients — the chain rule, mechanized; see Calculus & Differentiation). They differ mainly in when the computation graph is built and where they are strongest in production.

The computation graph is the central concept: as you compute, the framework builds a directed graph whose nodes are operations and whose edges are tensors. To get gradients, it walks that graph backward. The question that splits the two frameworks is when the graph gets built — every step as you run (dynamic), or once up front (static). The intuition: a dynamic graph is like cooking while reading the recipe line by line (you can improvise mid-dish); a static graph is like printing the whole recipe, locking it, then running it on an assembly line a thousand times (no improvising, but fast and shippable).

The backward pass is the chain rule applied along that graph. For a tiny chain $L = f(g(w))$:

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial g}\cdot\frac{\partial g}{\partial w}\]

In words: the sensitivity of the loss to a weight is the product of the local sensitivities along the path from that weight to the loss — multiply the per-step “if I nudge this, that moves by…” factors together. Also written: for a deep stack of layers $L = f_n(f_{n-1}(\cdots f_1(w)))$, this generalizes to the product $\frac{\partial L}{\partial w} = \prod_{k=1}^{n} \frac{\partial f_k}{\partial f_{k-1}}$, which is exactly what .backward() (PyTorch) or the gradient tape (TensorFlow) evaluates for every parameter at once.

A two-link worked example with real numbers: let $g = w^2$ and $L = 3g$, with $w = 2$. The local factors are $\frac{\partial g}{\partial w} = 2w = 4$ and $\frac{\partial L}{\partial g} = 3$. Multiply them along the chain: $\frac{\partial L}{\partial w} = 3 \times 4 = 12$. That single multiplication of “how much $L$ moves per unit $g$” times “how much $g$ moves per unit $w$” is the entire backward pass — autograd just does it for millions of links at once.

PyTorch: dynamic graphs and the training loop

PyTorch builds the graph dynamically — an approach called “define-by-run”. Each line of Python is a graph operation as it executes, so the model is just ordinary control flow. You can put an if or a Python for loop in the middle of a forward pass and it simply works, because the graph is rebuilt fresh on every call. This makes debugging feel like debugging normal code — you can drop a print or a breakpoint anywhere and inspect real numbers.

Autograd is the heart of it. A tensor created with requires_grad=True remembers the operations applied to it. Calling .backward() on a scalar loss walks that history in reverse, depositing $\partial \text{loss} / \partial \text{parameter}$ into each tensor’s .grad. A one-variable demo makes the mechanism concrete and checkable against hand calculus:

import torch
x = torch.tensor(3.0, requires_grad=True)
y = x**2 + 2*x          # y = x^2 + 2x  ->  dy/dx = 2x + 2
y.backward()            # replay backward
print(x.grad)           # 2*3 + 2 = 8.0  ✓ (matches calculus)

Models subclass nn.Module: you declare layers in __init__ and the data path in forward. Training is an explicit loop you write yourself — four lines that recur in every PyTorch project: zero the old gradients, run the forward pass, backpropagate, take an optimizer step.

import torch, torch.nn as nn

model = nn.Sequential(nn.Linear(2, 8), nn.ReLU(), nn.Linear(8, 1))
opt   = torch.optim.SGD(model.parameters(), lr=0.1)
lossf = nn.MSELoss()

for epoch in range(100):
    opt.zero_grad()              # 1. clear .grad (they accumulate!)
    pred = model(X)              # 2. forward pass
    loss = lossf(pred, y)
    loss.backward()              # 3. autograd fills every .grad
    opt.step()                   # 4. param -= lr * param.grad

The optimizer step that line 4 runs, for plain SGD, is one rule per parameter $\theta$:

\[\theta \leftarrow \theta - \eta\,\nabla_\theta L\]

In words: nudge each weight a small step (size set by the learning rate $\eta$) in the direction that most decreases the loss — downhill on the error surface. Also written: component-wise, $\theta_j \leftarrow \theta_j - \eta\,\frac{\partial L}{\partial \theta_j}$ — exactly the param -= lr * param.grad that opt.step() performs for every entry of every parameter tensor.

flowchart LR
  A[zero_grad] --> B[forward: pred = model X]
  B --> C[loss = lossf pred, y]
  C --> D[loss.backward<br/>fills .grad]
  D --> E[opt.step<br/>updates weights]
  E --> A

What that loop actually does is roll the parameters downhill on the loss surface — each opt.step() is one nudge toward the bottom. The ball below eases down a loss curve the way training eases the loss toward its minimum:

The explicitness is the appeal: nothing is hidden, so research code — custom losses, odd architectures, novel training schemes — is natural to express. This is why PyTorch dominates research. For real datasets you rarely feed X whole; you wrap it in a Dataset/DataLoader that yields shuffled mini-batches, and the same four-line body runs per batch:

from torch.utils.data import TensorDataset, DataLoader

loader = DataLoader(TensorDataset(X, y), batch_size=32, shuffle=True)
for epoch in range(100):
    for xb, yb in loader:                 # mini-batches, reshuffled each epoch
        opt.zero_grad()
        loss = lossf(model(xb), yb)
        loss.backward()
        opt.step()

Warning

Common mistake: forgetting opt.zero_grad(). PyTorch accumulates gradients into .grad by design (which is useful when you deliberately split one batch across several backward passes). Skip the reset and each step silently adds the previous step’s gradient too — training diverges with no error message.

TensorFlow / Keras: compile, fit, and the production story

TensorFlow historically built a static graph — define the whole computation first, then run data through the frozen graph repeatedly. That is harder to debug (you cannot just print an intermediate value; it is a graph node, not a number) but easier to optimize and to ship as a self-contained artifact. Modern TF runs eagerly by default, like PyTorch, but can compile any function into a static graph with the @tf.function decorator, getting both worlds: easy debugging while developing, a fast frozen graph for production.

Its high-level API, Keras, hides the training loop behind three calls. You compile the model with an optimizer, a loss, and metrics; you fit it on data; you predict. The same network as the PyTorch example, the Keras way:

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(8, activation="relu", input_shape=(2,)),
    keras.layers.Dense(1),
])
model.compile(optimizer="sgd", loss="mse")   # wire up the loop
model.fit(X, y, epochs=100, verbose=0)        # the loop runs for you
model.predict(X_new)

The contrast is the whole story: PyTorch hands you the loop; Keras hands you fit. The same four steps (zero, forward, backward, step) run inside fit — you just do not type them. Keras trades control for brevity, which is exactly right when your model is a standard architecture and you want a baseline fast.

Where TensorFlow still leads is the deployment and edge ecosystem: TF Serving for high-throughput model servers, TensorFlow Lite for phones and microcontrollers, and TensorFlow.js for running models directly in the browser. That maturity at the edge is the main reason teams still pick it.

JAX: functional autograd and compilation

Worth knowing as the third framework: JAX treats differentiation as a function transform. Instead of an object that remembers its history, you write a plain math function and wrap it: grad(f) returns a new function that computes the derivative, jit(f) compiles it to fused GPU/TPU code via XLA, and vmap(f) auto-vectorizes it over a batch dimension. The mental shift: PyTorch attaches gradients to tensors; JAX attaches them to functions.

import jax, jax.numpy as jnp

def loss(w, x, y):
    return jnp.mean((x @ w - y) ** 2)        # mean squared error

grad_fn = jax.jit(jax.grad(loss))            # d loss / d w, compiled
g = grad_fn(w, X, Y)                          # one fused, batched call

This functional style (pure functions, explicit randomness, no in-place state) is why JAX powers much large-scale research and most TPU training. It is less batteries-included than PyTorch for everyday modeling, but unmatched when you want gradients of gradients or aggressive compilation.

When to choose which

Concern	PyTorch	TensorFlow / Keras	JAX
Graph	dynamic (define-by-run)	static-capable (`@tf.function`)	functional + `jit` (XLA)
Training loop	you write it (explicit)	`model.fit` (managed)	you write it (functional)
Debugging	normal Python, easy	easier than old TF, still graph-y	pure functions, traces can surprise
Research mindshare	dominant	minority	rising (TPU, large scale)
Edge / mobile / browser	improving (ExecuTorch)	mature (TFLite, TF.js)	limited
Best fit	research, custom training	locked-down production & edge	TPU, high-perf research

flowchart TD
  Q{What matters most?} -->|research, novel<br/>architectures| PT[PyTorch]
  Q -->|fast standard model<br/>fit/compile| K[Keras]
  Q -->|phone / browser /<br/>microcontroller| TF[TF Lite / TF.js]
  Q -->|TPU / heavy<br/>compilation| J[JAX]
  Q -->|just get a<br/>baseline running| K

Tip

Rule of thumb: prototyping or doing research → PyTorch. Need a standard model trained fast, or deployed to a phone or browser → Keras/TensorFlow. Squeezing TPUs or doing math-heavy research → JAX. In practice the gap keeps shrinking; all export to the shared ONNX format, so a model trained in one framework can be served by another.

ONNX: the shared interchange format

A recurring practical need: train in one framework, serve in another (or on specialized hardware). ONNX (Open Neural Network Exchange) is the common file format that makes this possible — a framework-agnostic description of the computation graph plus weights. The analogy: ONNX is the PDF of models. You author in Word or Pages (PyTorch, TF), export to PDF (ONNX), and anyone can open and run it without your original editor.

import torch
torch.onnx.export(model, X[:1], "model.onnx",   # one sample fixes input shape
                  input_names=["x"], output_names=["y"])

# then serve anywhere ONNX Runtime runs — CPU, GPU, mobile, browser
import onnxruntime as ort
sess = ort.InferenceSession("model.onnx")
out = sess.run(None, {"x": X[:1].numpy()})

ONNX Runtime applies graph optimizations (operator fusion, constant folding) and runs on many backends, so the same exported file serves a cloud GPU, a laptop CPU, or a phone — decoupling the training framework from the serving target.

31.3 — The Hugging Face ecosystem

For most of the last decade, “use a deep model” meant building and training one yourself. Today it usually means downloading one. Hugging Face has become the default hub for pretrained models, datasets, and the glue around them — the GitHub of models. The intuition: instead of forging your own engine, you pull a proven engine off the shelf and bolt it into your car; you only build from scratch when nothing on the shelf fits.

Three libraries matter most:

transformers — thousands of pretrained Transformer models (BERT, GPT-style, ViT, Whisper) behind one uniform API, plus the high-level pipeline that bundles tokenizer + model + post-processing into a single callable.
datasets — memory-mapped, streamable datasets so you can load something larger than RAM with one line.
peft — parameter-efficient fine-tuning (LoRA and friends), which adapts a giant model by training a tiny number of extra weights.

The fastest path from zero to a working model is the pipeline — three lines to a sentiment classifier, with the model downloaded and cached automatically:

from transformers import pipeline

clf = pipeline("sentiment-analysis")          # downloads a default model
print(clf("Hugging Face makes this absurdly easy"))
# [{'label': 'POSITIVE', 'score': 0.9998}]

When you need control, drop one level to the tokenizer + model pair. The tokenizer turns text into the integer IDs the model expects; the model returns raw scores (logits) you convert to probabilities with softmax:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

name = "distilbert-base-uncased-finetuned-sst-2-english"
tok   = AutoTokenizer.from_pretrained(name)
model = AutoModelForSequenceClassification.from_pretrained(name)

batch  = tok(["great film", "total waste"], padding=True, return_tensors="pt")
logits = model(**batch).logits                # raw scores, shape (2, 2)
probs  = torch.softmax(logits, dim=-1)        # -> probabilities per class

Fine-tuning and LoRA

Fine-tuning a full large language model means updating all its weights — billions of numbers — which needs heavy GPUs and storage. LoRA (Low-Rank Adaptation) is the lazy, effective shortcut: freeze the original weights and inject small trainable matrices alongside them, training only those. The analogy: rather than re-teaching a fluent translator the whole language, you hand them a small phrasebook of your domain’s terms.

The trick is one observation: the change a layer needs is usually simple, even when the layer is huge. So instead of learning a full grid of edits, LoRA learns two thin strips and multiplies them to reconstruct that grid.

Concrete numbers make it click. Say a layer is a $1000 \times 1000$ weight matrix — that is 1,000,000 numbers to retrain. Pick a tiny rank $r = 8$. Now you train only a $1000 \times 8$ strip and an $8 \times 1000$ strip: $8000 + 8000 = $ 16,000 numbers — about 1.6% of the original. You froze a million weights and trained sixteen thousand, and their product still adds a meaningful correction.

Formally, LoRA replaces a weight update $\Delta W$ (a big $d \times k$ matrix) with a product of two skinny matrices:

\[W' = W + \Delta W \approx W + B A, \qquad B \in \mathbb{R}^{d\times r},\; A \in \mathbb{R}^{r\times k},\; r \ll \min(d,k)\]

In words: instead of learning a full grid of changes, learn two thin strips whose product reconstructs an approximate change — because the rank $r$ is tiny, you train a few thousand numbers instead of millions. Also written: the effective forward pass is $h = Wx + B(Ax)$, i.e. the frozen layer’s output plus a low-rank correction; only $A$ and $B$ receive gradients, so the trainable count drops from $d\cdot k$ to $r(d+k)$.

With peft, wrapping a model for LoRA is a few lines; you then train only the adapter and save just those small weights:

from peft import LoraConfig, get_peft_model

cfg = LoraConfig(r=8, lora_alpha=16, target_modules=["q_proj", "v_proj"])
model = get_peft_model(base_model, cfg)
model.print_trainable_parameters()
# trainable: 0.2% of all params  <- the whole point

Tip

Rule of thumb: before training anything, search the Hub. If a pretrained model is close, fine-tune it (LoRA if it is large); only train from scratch when nothing fits or your data is truly unlike anything public. Downloading beats forging almost every time.

Warning

Common mistake: loading a tokenizer and a model from different checkpoints. The model expects the exact vocabulary and special tokens its own tokenizer produces; pair bert-base text with a roberta tokenizer and the IDs are silently wrong — no crash, just garbage predictions. Always build both from the same name (as in the example above), and call model.eval() before inference so dropout and batch-norm behave deterministically.

31.4 — Streaming & real-time

Everything so far assumed a batch: a fixed table sitting on disk, processed once. But many systems must react to data as it arrives — fraud scores within milliseconds of a card swipe, recommendations updated mid-session, dashboards that never stop refreshing. Streaming treats data as an unbounded sequence of events flowing through the system continuously, rather than a finite file processed in one shot. The mental flip: batch is a photo album you flip through once; streaming is a live video feed you never stop watching.

The central new problem is feature freshness: a model’s input features (e.g. “number of transactions in the last 5 minutes”) must be computed from data that is seconds old, not from last night’s batch job. Three tools dominate, and they play different roles — one moves events, two compute over them.

flowchart LR
  P[Producers<br/>apps, sensors] -->|events| K[(Kafka<br/>durable log)]
  K --> S[Spark Streaming<br/>micro-batches]
  K --> F[Flink<br/>true streaming]
  S --> FS[(Feature store<br/>fresh features)]
  F --> FS
  FS --> M[Online model<br/>score / update]
  style K fill:#fde,stroke:#939
  style F fill:#def,stroke:#369

Kafka: the durable event log

Apache Kafka is the pipe everything else plugs into. It is a distributed, append-only log: producers write events to named topics, and the events are kept in order and retained for a configurable time. Consumers read at their own pace, each tracking an offset — its current position in the log. Because the log is durable and replayable, a consumer that crashes can resume exactly where it left off, and a brand-new consumer can replay history from the beginning to rebuild its state. Kafka itself does no computation; it is the reliable backbone that decouples producers from consumers, so a slow or failed consumer never blocks the producers.

The everyday analogy: Kafka is a conveyor belt with numbered slots that never erases. Each worker (consumer) remembers the slot number it last picked up, so a worker who steps away can return and resume from exactly that slot — and a new worker can rewind to slot 0 and replay everything.

# producer: emit a transaction event
from kafka import KafkaProducer
import json
prod = KafkaProducer(value_serializer=lambda v: json.dumps(v).encode())
prod.send("transactions", {"user": 42, "amount": 99.5})

# consumer: read events as they arrive, scoring each
from kafka import KafkaConsumer
con = KafkaConsumer("transactions",
                    value_deserializer=lambda b: json.loads(b))
for msg in con:                      # blocks, yields events forever
    txn = msg.value
    score = model.predict_one(txn)   # act on each event

A quick picture of the log abstraction — events appended on the right, each consumer reading independently at its own offset:

Streaming feels less abstract once you watch an event move: a producer drops it on the belt, it travels the durable log, and a consumer picks it up and scores it. Here three events drift through, end to end:

Spark Streaming vs Flink: micro-batch vs true streaming

Once events are in Kafka, you need an engine to compute over them — windowed aggregations, joins, feature updates. The two leaders differ in their core model:

Spark Structured Streaming uses micro-batching: it slices the stream into tiny batches (say, every second) and runs the ordinary Spark batch engine on each slice. This is simple and reuses all of Spark’s mature batch tooling, but latency is bounded below by the batch interval — you cannot react faster than one batch, so think sub-second to seconds.
Apache Flink is a true streaming engine: it processes each event the moment it arrives, giving millisecond latency. It also has first-class support for event time (ordering events by when they actually happened, not when they were received) and stateful windows, which matters when events arrive late or out of order — a common reality with mobile clients on flaky networks.

The analogy: Spark is a bus that leaves every minute (cheap, predictable, but you wait for the next departure); Flink is a taxi that leaves the instant you arrive (low wait, more moving parts).

A concrete feature-freshness task: maintain a 5-minute rolling count of transactions per user, used as a fraud feature. In Flink’s pseudocode the windowing is declarative — you say what window you want, and the engine maintains it as events stream in:

# Flink-style: tumbling 5-min window count per user (pseudo-API)
(stream
  .key_by(lambda e: e["user"])             # partition by user
  .window(TumblingEventTimeWindows(minutes=5))
  .aggregate(Count())                       # count events per window
  .sink(feature_store))                     # write fresh feature out

A worked trace makes the window concrete. Suppose user 42 swipes at event-times 10:01, 10:03, 10:04, and 10:07, and we use tumbling (non-overlapping) 5-minute windows aligned to the clock:

Window	Events from user 42	Count feature
10:00–10:05	10:01, 10:03, 10:04	3
10:05–10:10	10:07	1

The swipe at 10:07 starts a fresh count because it falls in the next window — exactly the “transactions in the last few minutes” signal a fraud model wants, recomputed continuously as events land. (If the 10:04 event arrived late at 10:06 due to a flaky network, event-time processing still files it under the 10:00–10:05 window; processing-time would wrongly drop it into the later one.)

This is the bridge to online learning (a model that updates incrementally as each labeled event arrives) and online inference (scoring each event live). The feature store sits in the middle: streaming jobs keep it fresh, the model reads from it at score time, and — crucially — the same feature definitions are reused for offline training to avoid train/serve skew (when the features a model sees in production differ from the ones it was trained on).

Tool	Role	Model	Latency	Pick when
Kafka	transport / storage	append-only log	n/a (buffer)	you need a durable, replayable event backbone
Spark Streaming	compute	micro-batch	~seconds	you already use Spark batch and seconds are fine
Flink	compute	true streaming	~milliseconds	you need low latency or correct event-time windows

Tip

Rule of thumb: Kafka is the noun (where events live); Spark and Flink are the verbs (what computes over them). Reach for Flink when latency or out-of-order event time matters; reach for Spark Streaming when you already live in the Spark ecosystem and second-scale latency is acceptable.

Warning

Common mistake: computing a streaming feature differently from its batch (training) version — a slightly different window edge, or different null-handling. The model then sees inputs at serving time that never appeared in training. This train/serve skew is the top cause of “great offline, broken online” models; share one feature definition across both the batch and stream paths.

31.5 — Experiment tracking & model packaging

There is a stage between “I wrote a training loop” and “this model serves real users” that the tools above quietly skip over: keeping track of what you ran. Train fifty variants over a week — different learning rates, feature sets, random seeds — and a folder of model_final_v3_REALLY_final.pt files becomes a graveyard you cannot reason about. Experiment tracking and model packaging tools fix this, and the dominant one is MLflow (with Weights & Biases a popular hosted alternative).

The intuition: a lab notebook for machine learning. A chemist does not trust memory for which reagent concentration gave which yield — they write every run in a notebook. An experiment tracker is that notebook, automated: for each run it records the parameters you chose, the metrics you got, and the artifacts you produced (the trained model file, plots, the exact data version), all queryable later so you can ask “which run had the best validation AUC, and what learning rate did it use?”

flowchart LR
  R[Training run] -->|log_param| P[params: lr, depth]
  R -->|log_metric| M[metrics: auc, loss]
  R -->|log_model| A[artifact: model + env]
  P --> T[(Tracking store<br/>queryable history)]
  M --> T
  A --> Reg[(Model Registry<br/>staging → production)]
  style T fill:#cde,stroke:#369
  style Reg fill:#dec,stroke:#393

The everyday MLflow idiom is a context manager wrapping your training, with three log calls:

import mlflow

with mlflow.start_run():
    mlflow.log_params({"lr": 0.1, "max_depth": 4})   # what you chose
    model.fit(X_train, y_train)
    auc = score(model, X_val, y_val)
    mlflow.log_metric("val_auc", auc)                # what you got
    mlflow.sklearn.log_model(model, "model")         # the artifact, with its deps

Two ideas earn their keep here. Reproducibility: a run logs not just the model but the code version, the parameters, and the Python environment, so a result is re-runnable months later rather than a one-time fluke. And the Model Registry: trained artifacts get versioned and promoted through stages (Staging → Production), so deployment pulls “the current production model” by name instead of someone hand-copying a file. That handoff — from the experiment notebook to a named, versioned, served model — is exactly the seam where this chapter’s tooling meets the MLOps & Deployment chapter.

Tip

Rule of thumb: start logging from your very first serious run, not after the mess. The cost is three lines per run; the payoff is never again wondering which checkpoint produced the number in last month’s slide deck.

31.6 — Quick reference

Tool / term	What it is	When / why
NumPy `ndarray`	typed, contiguous n-dim array	fast vectorized math; the base of the whole stack
Broadcasting	stretch size-1 dims to match shapes	combine differently-shaped arrays with no copy
Pandas `DataFrame`	labeled, index-aligned table	clean/join/group mixed-type tabular data
`groupby`	split–apply–combine	per-group aggregates (mean per city, etc.)
scikit-learn estimator	object with `fit`/`predict`/`transform`	uniform API; swap models by one line
`Pipeline`	chained transformers + estimator	re-fit per fold to prevent data leakage
`ColumnTransformer`	per-column-type transforms	scale numerics + one-hot categoricals together
`GridSearchCV`	CV over hyperparameter grid	leakage-safe tuning of the whole pipeline
Gradient boosting	XGBoost / LightGBM / CatBoost	strongest fast baseline on tabular data
Tensor	array on GPU with autograd	the unit of deep-learning compute
Autograd	record ops, replay backward	`.backward()` gives every gradient (chain rule)
PyTorch loop	`zero_grad→forward→backward→step`	explicit, debuggable; research default
Keras `fit`	managed training loop	standard model trained fast; strong edge story
JAX `grad`/`jit`/`vmap`	functional autograd + compile	TPU and math-heavy, high-performance research
ONNX	framework-agnostic graph + weights	train in one tool, serve in another
LoRA $W+BA$	low-rank adapter, $r(d{+}k)$ params	fine-tune a giant model on a small budget
Kafka	durable, replayable append-only log	the event backbone; decouples producers/consumers
Spark Streaming	micro-batch compute (~seconds)	already in Spark; second-scale latency is fine
Flink	true per-event streaming (~ms)	low latency or correct event-time windows
MLflow	log params/metrics/artifacts + registry	reproducible runs; named, versioned deployment

31.7 — Key takeaways

The Python ML stack layers cleanly: NumPy (typed n-dimensional arrays, vectorized via broadcasting) → Pandas (labeled, index-aligned tabular data, groupby split-apply-combine) → scikit-learn (the uniform fit/predict/transform estimator API).
A scikit-learn Pipeline chains transformers and a final estimator into one object; fitting it per cross-validation fold prevents data leakage from preprocessing. ColumnTransformer routes numeric vs categorical columns, and GridSearchCV tunes the whole thing leakage-safely.
For tabular data, gradient-boosting libraries (XGBoost, LightGBM, CatBoost) are the strongest fast baseline — often beating neural nets — and drop into the scikit-learn API.
Deep learning frameworks share two pillars: GPU tensors and autograd (the mechanized chain rule, triggered by .backward()).
PyTorch uses a dynamic graph and an explicit four-step loop (zero_grad → forward → backward → step) and dominates research. Keras/TensorFlow hide the loop behind compile/fit and lead on edge/mobile/browser deployment (TFLite, TF.js). JAX offers functional autograd (grad/jit/vmap) for TPU and high-performance research. All export to ONNX, the framework-agnostic interchange format that decouples training from serving.
The Hugging Face ecosystem (transformers, datasets, peft) makes downloading a pretrained model the default; LoRA adapts a giant model by training tiny low-rank matrices ($W + BA$) instead of all its weights.
Streaming treats data as unbounded events. Kafka is the durable, replayable log (transport); Spark Streaming computes in micro-batches (~seconds); Flink computes per-event with event-time windows (~milliseconds).
The recurring real-time challenge is feature freshness and avoiding train/serve skew — share one feature definition across batch and stream.
Experiment tracking (MLflow, W&B) is the lab notebook for ML: log params, metrics, and artifacts per run, and promote versioned models through a registry for reproducible, named deployment.

31.8 — See also

Data Preprocessing — what the Pandas and scikit-learn transformers actually do to features.
Ensemble Methods — the boosting and bagging theory behind XGBoost, LightGBM, and CatBoost.
Calculus & Differentiation — the chain rule that autograd mechanizes.
Neural Networks (Core) — the nn.Module layers and training dynamics framed here as code.
Transformers & Attention — the architecture behind most Hugging Face models, and where LoRA fine-tuning is applied.
MLOps & Deployment — feature stores, model serving, experiment tracking, and the production lifecycle around these tools.
AI Infrastructure & Efficient Inference — GPUs, ONNX, and serving the tensors these frameworks produce.
Model Evaluation & Tuning — cross-validation, the setting where Pipelines prevent leakage.
Recommender Systems and Anomaly & Fraud Detection — canonical consumers of streaming features and online inference.

↪ The thread continues → Chapter 32 · 🧭 Search & Problem Solving

We’ve followed the statistical, learn-from-data branch of AI to its tooling. Now we step back to AI’s older, deeper root — the symbolic tradition — beginning with search.

📖 All chapters | ← 30 · 🚀 AI Infrastructure & Efficient Inference | 32 · 🧭 Search & Problem Solving →