Chapter 39 — 🌠 Frontier & Emerging Directions

📖 All chapters | ← 38 · ⚖️ AI Ethics, Fairness & Safety | 40 · 🔗 Graph Machine Learning →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

The previous chapters mapped what AI can already do reliably. This one looks at the active research frontier — the ideas that are reshaping how models are trained, adapted, and deployed, and the hard problems still standing between today’s systems and anything we’d call general intelligence. These are not finished techniques but live directions: powerful, partly understood, and changing fast. The goal is to give you durable intuition for each so the next paper or product announcement makes sense.

🧭 In context: the leading edge of ML research · used to build and adapt foundation models that learn with little labeled data, retain skills over time, and act in the world · the one key idea: learn structure from the data itself, then transfer and adapt cheaply instead of training from scratch.

💡 Remember this: almost every frontier idea is the same move — learn structure from cheap, unlabeled data once, then transfer and adapt it cheaply rather than retraining from scratch.

39.1 — Self-supervised and contrastive learning

The bottleneck in classical supervised learning is labels: humans must tag every example, which is slow and expensive. Self-supervised learning (SSL) sidesteps this by inventing a label from the data itself. You hide part of the input and ask the model to predict it. No human annotation is needed, so you can train on the entire internet.

The hidden-prediction task is called a pretext task. Mask a word in a sentence and predict it (this is exactly how BERT and GPT-style models are pretrained — see Attention & Transformers and Large Language Models). Mask a patch of an image and reconstruct it. Predict whether two video frames are in the right order. The model can only solve these puzzles by learning genuine structure — grammar, object shapes, physics — and that learned structure is what we actually want. SSL is the engine behind every foundation model: a single large model pretrained on broad data, then reused for many tasks.

Contrastive learning is one especially powerful flavor of SSL. The intuition: pull together things that are “the same” and push apart things that are “different,” without ever naming what they are. Take an image, make two random augmentations of it (a crop, a color shift) — these form a positive pair, two views of the same underlying thing. Every other image in the batch is a negative. Train the encoder so the two positive views land close in embedding space and negatives land far.

The standard objective is the InfoNCE loss. Think of it as a multiple-choice quiz: the anchor $z_i$ must pick its true twin $z_j$ out of a lineup that contains the twin plus every negative. The “score” for each candidate is its similarity to the anchor (cosine, divided by temperature $\tau$); a softmax turns those scores into pick-probabilities, and the loss is just $-\log(\text{probability assigned to the true twin})$ — high when the twin is the clear winner:

\[ \mathcal{L}_{i} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\displaystyle\sum_{k \neq i} \exp(\text{sim}(z_i, z_k)/\tau)} \]

The denominator sums over every candidate but the anchor itself ($k \neq i$) — so it counts the true twin and all the negatives. That makes this exactly an ordinary softmax cross-entropy whose “correct class” is the positive partner.

In words: “Out of everyone in the lineup, how confidently does the model pick my true twin? — the more confident, the lower the loss.” Also written: numerator term minus a log-sum-exp, $\mathcal{L}_i = -\,\text{sim}(z_i,z_j)/\tau + \log\sum_{k\neq i}\exp(\text{sim}(z_i,z_k)/\tau)$.

import numpy as np
# 3 items, 2 augmented views each -> 6 embeddings; pos pair = same item
np.random.seed(0)
z = np.random.randn(6, 8)
z /= np.linalg.norm(z, axis=1, keepdims=True)   # L2 normalize
sim = z @ z.T / 0.1                              # cosine / temperature
np.fill_diagonal(sim, -1e9)                      # mask self only (k=i)
# positives: 0<->1, 2<->3, 4<->5; denominator = all off-diagonal (pos + negs)
pos = np.array([1,0,3,2,5,4])
num = sim[np.arange(6), pos]                     # positive logit per row
den = np.log(np.exp(sim).sum(1))                 # log-sum over pos + negs
loss = -np.mean(num - den)
print(round(float(loss), 3))   # ~1.6 ≈ log(5): random embeds, 5 candidates

With untrained, random embeddings the positive is no closer than the five other candidates, so the model is guessing uniformly among them and the loss sits near $\log 5 \approx 1.609$. As training aligns each positive pair, the numerator grows and the loss falls toward zero — that drop is the learning signal.

Below: positives drawn together, negatives pushed away on the unit circle.

Tip

The temperature $\tau$ controls how harshly negatives are penalized. Small $\tau$ (e.g. 0.05) makes the model focus on the hardest negatives — sharp but unstable. Larger $\tau$ smooths the gradient. It is one of the most impactful knobs in contrastive training.

Warning

Naïve contrastive learning can collapse: the encoder maps everything to the same point, making all positives trivially close. Defenses include enough negatives (large batches or a memory bank), stop-gradient tricks (BYOL, SimSiam), or redundancy-reduction losses (Barlow Twins). If your embeddings all look identical, collapse is the first suspect.

In a real framework. You rarely hand-code InfoNCE in production. A faithful, idiomatic PyTorch version of the SimCLR loss over a batch of $2N$ stacked views:

import torch, torch.nn.functional as F

def info_nce(z, tau=0.1):
    # z: (2N, d) — rows 0..N-1 and N..2N-1 are the two views; i and i+N are positives
    z = F.normalize(z, dim=1)
    sim = z @ z.T / tau                                  # (2N, 2N) cosine / temperature
    n = z.shape[0] // 2
    sim.fill_diagonal_(float("-inf"))                    # mask self (k = i)
    targets = torch.cat([torch.arange(n) + n, torch.arange(n)])  # i's positive index
    return F.cross_entropy(sim, targets.to(z.device))    # softmax CE = InfoNCE

z = torch.randn(8, 16)            # 4 items × 2 views
print(round(info_nce(z).item(), 3))   # ~log(7): random rows, 7 candidates each

The whole loss is a single cross_entropy once you set each row’s target to its positive partner’s index — the framework’s autograd then backpropagates the pull-together / push-apart signal for free.

39.2 — Transfer, multi-task, and meta-learning

Once you have a pretrained foundation model, you rarely train from scratch again. Transfer learning reuses the knowledge in a model trained on one task as a starting point for another. The early layers of a vision model learn edges and textures useful for any image task; you keep them and only retrain the final layers (fine-tuning) on your small target dataset. This is why a hospital with 500 labeled X-rays can build a strong classifier — it stands on a model that already saw millions of images.

Multi-task learning (MTL) trains one model on several tasks at once, sharing a common representation. The tasks regularize each other: features useful for predicting sentiment also help detect topic, so learning both gives a better backbone than either alone. The shared trunk forks into per-task heads.

flowchart LR
  X[Input] --> S[Shared encoder]
  S --> H1[Head A: sentiment]
  S --> H2[Head B: topic]
  S --> H3[Head C: language ID]

Transfer learning in scikit-learn / PyTorch. The everyday version of transfer is “freeze the backbone, retrain the head.” A few lines with a pretrained ResNet:

import torch, torchvision
net = torchvision.models.resnet18(weights="IMAGENET1K_V1")
for p in net.parameters():            # freeze the pretrained backbone
    p.requires_grad = False
net.fc = torch.nn.Linear(net.fc.in_features, 2)   # new head for YOUR 2 classes
# only net.fc.parameters() carry gradients -> trains on tiny data, fast
opt = torch.optim.Adam(net.fc.parameters(), lr=1e-3)

Only the final linear layer learns; the millions of frozen weights below it stay exactly as ImageNet left them. That is the 500-X-rays scenario in code.

Meta-learning, or learning to learn, goes one level up. Instead of learning a single task, the model learns how to adapt quickly to new tasks from very few examples. The classic algorithm is MAML (Model-Agnostic Meta-Learning). It searches for an initialization $\theta$ such that, for any new task, a single gradient step lands on a good task-specific solution.

The trick is a nested loop. In the inner loop, you take one task, do a few gradient steps to get adapted parameters $\theta'_t = \theta - \alpha \nabla_\theta \mathcal{L}_t(\theta)$. In the outer loop, you update the shared $\theta$ to make those adapted parameters perform well after adaptation:

\[ \theta \leftarrow \theta - \beta \nabla_\theta \sum_{t} \mathcal{L}_t\big(\theta - \alpha \nabla_\theta \mathcal{L}_t(\theta)\big) \]

In words: “Adjust the shared starting weights so that, after one quick practice step on each task, the result is as good as possible — averaged over all tasks.” You optimize the post-practice score, not the score now. Also written: unrolling the inner step, $\theta \leftarrow \theta - \beta \sum_t (I - \alpha \nabla^2_\theta \mathcal{L}_t(\theta))\,\nabla_{\theta'}\mathcal{L}_t(\theta'_t)$ — the $\nabla^2$ (a Hessian) is the second-order term; dropping it ($I - \alpha\nabla^2 \approx I$) gives first-order MAML.

The intuition: you are not optimizing for good performance now, but for being one step away from good performance on whatever comes next — a launchpad, not a destination.

A tiny worked example makes the launchpad idea concrete. Suppose each task $t$ is “find the minimum of $\mathcal{L}_t(\theta) = (\theta - c_t)^2$” for some task-specific center $c_t$, and the two tasks you train on have centers $c_1 = 2$ and $c_2 = 8$. The gradient is $\nabla\mathcal{L}_t = 2(\theta - c_t)$. With inner step $\alpha = 0.5$, one inner update from a shared start $\theta$ on task $t$ gives $\theta'_t = \theta - 0.5\cdot 2(\theta - c_t) = c_t$ — it lands exactly on the task’s optimum in one step, from any $\theta$.

# MAML intuition: best init is the one closest to all task optima at once
import numpy as np
centers = np.array([2.0, 8.0])      # task optima c1, c2
alpha, beta = 0.5, 0.1
theta = 0.0                          # shared init
for _ in range(200):
    grad = 0.0
    for c in centers:
        adapted = theta - alpha*2*(theta - c)   # inner step -> lands at c
        grad += 2*(adapted - c)                 # outer grad d/dtheta of (adapted-c)^2
    theta -= beta*grad                          # outer update
print(round(theta, 3))   # -> 5.0, the midpoint: equidistant from both tasks

The meta-learned init converges to $\theta = 5$, the midpoint of the two task centers — the single starting point from which one inner step reaches either task’s optimum with the least average effort. That is exactly what “an initialization that adapts fast” means.

flowchart TB
  M[Meta-params θ] --> A1[Inner: adapt to task 1 → θ′₁]
  M --> A2[Inner: adapt to task 2 → θ′₂]
  A1 --> E[Evaluate adapted params]
  A2 --> E
  E --> U[Outer: update θ to improve post-adaptation loss]
  U -.next meta-step.-> M

Tip

A simple rule: transfer = reuse one model on one new task; multi-task = one model, many tasks at once; meta-learning = learn an initialization or strategy that makes future learning fast. In-context learning in LLMs (Chapter 23) is meta-learning that emerged for free from scale.

39.3 — Few-shot and zero-shot learning

Humans recognize a new animal from a single picture. Standard deep nets need thousands. Few-shot learning aims to classify from only a handful of labeled examples per class; zero-shot learning classifies categories the model has never seen a single example of, using only a description.

Few-shot is usually framed as N-way K-shot: given a small support set of N classes with K examples each, classify a new query point. Metric-based methods are the cleanest approach: embed everything into a space where same-class points cluster, then classify by distance. Prototypical networks compute one prototype per class — the mean embedding of its K support examples — and assign a query to the nearest prototype.

\[ c_n = \frac{1}{K} \sum_{i \in \text{class } n} f(x_i), \qquad \hat{y} = \arg\min_n \; \lVert f(x_q) - c_n \rVert \]

In words: “Average each class’s few examples into one representative point (its prototype), then label a new point by whichever prototype it sits closest to.” Also written: the assignment is equivalently a softmax over negative squared distances, $\hat y = \arg\max_n \,\text{softmax}_n(-\lVert f(x_q)-c_n\rVert^2)$ — nearest-prototype and lowest-loss class coincide.

import numpy as np
# 2-way 3-shot: class A near (1,1), class B near (4,4)
A = np.array([[1,1.2],[0.8,1],[1.1,0.9]])
B = np.array([[4,3.8],[4.2,4],[3.9,4.1]])
cA, cB = A.mean(0), B.mean(0)          # prototypes
q = np.array([3.7, 4.2])               # query
print("A" if np.linalg.norm(q-cA) < np.linalg.norm(q-cB) else "B")  # -> B

Zero-shot learning needs a bridge between classes and a shared semantic space. Modern systems use text. CLIP (Multimodal AI) trains an image encoder and a text encoder contrastively so that an image and its caption land nearby. To classify into never-seen classes, you embed the candidate label strings (“a photo of a zebra”, “a photo of an okapi”) and pick the label whose text embedding is closest to the image. The model recognizes an okapi it never trained on because the word sits in a meaningful place in the joint space.

Zero-shot with CLIP in three lines. Hugging Face exposes exactly this as a pipeline — no training, just text prompts as the “classes”:

from transformers import pipeline
clf = pipeline("zero-shot-image-classification", model="openai/clip-vit-base-patch32")
out = clf("okapi.jpg", candidate_labels=["a photo of a zebra",
                                         "a photo of an okapi",
                                         "a photo of a horse"])
print(out[0]["label"], round(out[0]["score"], 3))   # picks okapi, never trained on it

The labels are free-form strings you supply at inference, so you can classify into categories the model has never seen a labeled example of — the recognition rides entirely on where each word lands in CLIP’s joint image–text space.

Warning

Few-shot benchmarks are easy to fool. If your pretraining data already contained the “novel” classes, you are measuring memorization, not generalization. Always check whether the test categories truly never appeared upstream — a common source of inflated zero-shot numbers.

39.4 — Continual / lifelong learning and catastrophic forgetting

Train a network on task A, then train it on task B, and something brutal happens: it forgets A almost entirely. This is catastrophic forgetting. The gradient steps for B overwrite the weights that encoded A, because standard training assumes all data is available at once (the i.i.d. assumption). Continual or lifelong learning is the effort to learn a stream of tasks over time without forgetting earlier ones.

The conflict is fundamental — the stability–plasticity dilemma. Too stable and the model can’t learn anything new; too plastic and it erases the past. Every method is a different balance point.

flowchart LR
  A[Learn Task A] --> B[Learn Task B]
  B --> C{Test on A}
  C -->|naive training| F[Forgotten ✗]
  C -->|continual method| R[Retained ✓]

Three families of defenses:

Family	Idea	Example
Regularization	Penalize changing weights important to old tasks	EWC, SI
Replay	Keep or generate old examples, mix into training	Experience replay
Architectural	Give each task its own parameters	Progressive nets, adapters

The cleanest is Elastic Weight Consolidation (EWC). After learning task A, estimate how important each weight was (via the Fisher information $F_i$ — roughly, how sharply the loss reacts to that weight). When learning B, add a quadratic penalty anchoring important weights near their old values $\theta^*_{A,i}$:

\[ \mathcal{L}_B(\theta) = \mathcal{L}_B^{\text{task}}(\theta) + \frac{\lambda}{2} \sum_i F_i \,(\theta_i - \theta^*_{A,i})^2 \]

In words: “Learn task B, but for each weight pay a fine proportional to how much it mattered to task A times how far you’ve dragged it from where A left it.” Important weights become expensive to move; unimportant ones move free. Also written: in vector form with a diagonal importance matrix, $\mathcal{L}_B(\theta) = \mathcal{L}_B^{\text{task}}(\theta) + \tfrac{\lambda}{2}\,(\theta - \theta^*_A)^\top \mathrm{diag}(F)\,(\theta - \theta^*_A)$ — a Mahalanobis-style pull toward A’s solution, weighted by Fisher information.

Unimportant weights ($F_i \approx 0$) move freely to learn B; important ones are held in place, preserving A. It’s like welding the load-bearing beams while remodeling the rest of the house.

A two-weight example shows the penalty steering the update. Say task A settled at $\theta^*_A = (1.0,\ 1.0)$ with Fisher importances $F = (10,\ 0.1)$ — weight 0 mattered a lot to A, weight 1 barely at all. Now task B alone would prefer to pull both weights to $(0,\ 0)$. With $\lambda = 1$, the EWC objective is $\mathcal{L}_B^{\text{task}}(\theta) + \tfrac12[\,10(\theta_0-1)^2 + 0.1(\theta_1-1)^2\,]$, and the Fisher-weighted penalty makes the important weight far stickier than the unimportant one.

import numpy as np
theta_A = np.array([1.0, 1.0])   # where task A landed
F       = np.array([10.0, 0.1])  # Fisher importance per weight
lam     = 1.0
theta   = theta_A.copy()         # start from A's solution
# task B "wants" both weights at 0 -> task gradient = (theta - 0)
for _ in range(200):
    g_task = (theta - 0.0)                 # pull toward B's optimum (0,0)
    g_ewc  = lam * F * (theta - theta_A)   # pull back toward A's weights
    theta -= 0.05 * (g_task + g_ewc)
print(np.round(theta, 3))   # -> [0.909, 0.091]: w0 stays near 1, w1 caves to 0

The important weight settles near $0.91$ (held by A), while the unimportant one collapses to $\approx 0.09$ (free to serve B). Algebraically each coordinate balances at $\theta_i = \frac{\lambda F_i}{1+\lambda F_i}$, so a large $F_i$ pins the weight near its old value and a tiny $F_i$ lets it move — exactly the selective rigidity EWC promises.

Tip

Even a tiny replay buffer — keeping just a few percent of old examples and reshuffling them in — is often a stronger and simpler baseline than elaborate regularization. When unsure, try replay first.

39.5 — Federated and privacy-preserving learning

Sometimes the data can’t move. Hospital records, phone keyboards, bank transactions — privacy, regulation, or sheer size keep it on-device. Federated learning (FL) flips the usual setup: instead of bringing data to the model, you bring the model to the data. Each client trains locally on its own private data and sends back only the model updates; a server averages them into a shared global model. Raw data never leaves the device.

The standard algorithm is FedAvg. The server broadcasts the current weights; each client runs a few local steps; the server averages the returned weights, weighted by how much data each client has.

\[ \theta^{t+1} = \sum_{k=1}^{K} \frac{n_k}{n}\, \theta_k^{t+1}, \qquad n = \sum_k n_k \]

In words: “The new global model is a head-count average of every client’s locally-trained model — clients with more data get a proportionally bigger vote.” Also written: equivalently in update form, $\theta^{t+1} = \theta^t + \sum_k \frac{n_k}{n}\,\Delta_k$ where $\Delta_k = \theta_k^{t+1} - \theta^t$ is client $k$’s local change — averaging weights and averaging updates are the same thing.

flowchart TB
  S[Server: global model θ] -->|broadcast| C1[Client 1 trains locally]
  S -->|broadcast| C2[Client 2 trains locally]
  S -->|broadcast| C3[Client 3 trains locally]
  C1 -->|Δθ₁| AGG[Weighted average]
  C2 -->|Δθ₂| AGG
  C3 -->|Δθ₃| AGG
  AGG -->|new θ| S

import numpy as np
# 3 clients, different data sizes; average their local weights by n_k
w = [np.array([1.0, 2.0]), np.array([1.5, 1.0]), np.array([0.5, 3.0])]
n = np.array([100, 50, 25])                 # examples per client
global_w = sum(nk*wk for nk, wk in zip(n, w)) / n.sum()
print(global_w)   # data-weighted FedAvg update

But updates can still leak. Gradients sometimes allow partial reconstruction of the training data. Two tools harden FL further. Differential privacy (DP) adds calibrated noise so that no single individual’s data measurably changes the output — formally, an algorithm is $(\varepsilon, \delta)$-DP if for any two datasets differing in one record, $\Pr[\mathcal{A}(D) \in S] \le e^{\varepsilon}\Pr[\mathcal{A}(D') \in S] + \delta$. Smaller $\varepsilon$ means stronger privacy and noisier results — a direct privacy/accuracy tradeoff. Secure aggregation uses cryptography so the server sees only the sum of updates, never any individual client’s.

In words (the DP definition): “Whether or not any one person is in the dataset, the odds of any outcome change by at most a small factor $e^{\varepsilon}$ (plus a tiny slack $\delta$) — so no result can betray that you were in it.” Also written: as a bounded likelihood ratio, $\frac{\Pr[\mathcal{A}(D)\in S] - \delta}{\Pr[\mathcal{A}(D')\in S]} \le e^{\varepsilon}$ for all neighboring $D, D'$ and all outcome sets $S$.

A pocket calculation makes the privacy/accuracy knob concrete. The standard recipe adds Gaussian noise with standard deviation $\sigma \propto 1/\varepsilon$. Halving $\varepsilon$ (stronger privacy) doubles the noise — so privacy and signal trade off directly:

import numpy as np
true_update = np.array([0.40, -0.20])     # the gradient we'd send in the clear
sensitivity = 1.0                          # how much one record can shift it
for eps in [8.0, 1.0, 0.25]:               # weaker -> stronger privacy
    sigma = sensitivity / eps              # noise scale grows as eps shrinks
    noisy = true_update + np.random.default_rng(0).normal(0, sigma, size=2)
    print(f"eps={eps:<4}  sigma={sigma:<4}  noisy={np.round(noisy, 3)}")
# small eps = strong privacy = large sigma = the real signal drowns in noise

At $\varepsilon = 8$ the noise barely perturbs the update; at $\varepsilon = 0.25$ it swamps it. There is no free lunch — every bit of guaranteed privacy is paid for in added noise, which is why DP deployments tune $\varepsilon$ to the weakest privacy the use case can ethically accept.

The animation makes the tradeoff visible: a clean signal (the true update) versus the jitter that differential privacy sprinkles on top — turn privacy up and the wobble grows.

Warning

FL is not automatically private. Sending gradients instead of data feels safe but is not — gradient-inversion attacks can recover images from updates. Privacy guarantees come from DP and secure aggregation layered on top of FL, not from FL alone.

39.6 — AutoML and neural architecture search

Designing a model is full of choices: which architecture, how many layers, which learning rate, what regularization. AutoML automates these choices so that good models can be built without an expert hand-tuning every knob. At its simplest, AutoML is automated hyperparameter optimization (Model Evaluation & Tuning) plus automated feature engineering and model selection — searching the space of pipelines for the best validation score.

The ambitious end is Neural Architecture Search (NAS): letting an algorithm design the network itself. You define a search space of possible building blocks and connections, a search strategy to explore it, and a way to estimate how good each candidate is.

flowchart LR
  SP[Search space: ops & connections] --> ST[Search strategy]
  ST --> EV[Estimate performance]
  EV -->|feedback| ST
  ST --> BEST[Best architecture]

Early NAS trained thousands of networks with reinforcement learning and cost thousands of GPU-days — a result so expensive it was mostly a demonstration of what’s possible. The breakthrough was making the search differentiable. DARTS relaxes the discrete choice “which operation goes on this edge?” into a continuous mixture: each edge computes a softmax-weighted blend of all candidate ops, with weights $\alpha$ you can train by gradient descent.

\[ \bar{o}(x) = \sum_{o \in \mathcal{O}} \frac{\exp(\alpha_o)}{\sum_{o'} \exp(\alpha_{o'})} \, o(x) \]

In words: “Instead of committing to one operation on this edge, run all of them and take a learnable weighted average — then let gradient descent decide which weight should win.” Also written: with mixing weights $w = \text{softmax}(\alpha)$, simply $\bar o(x) = \sum_o w_o\, o(x) = w^\top [\,o_1(x), \dots, o_{|\mathcal O|}(x)\,]$ — a convex combination of the candidate ops’ outputs.

A three-operation example makes the mixture tangible. Suppose an edge can be a $3\times3$ convolution, a skip-connection, or a max-pool, with current architecture parameters $\alpha = (2.0,\ 1.0,\ 0.0)$. The softmax turns these raw scores into mixing weights, and the edge’s output is the weighted blend of all three operations — fully differentiable, so gradient descent can nudge $\alpha$ toward whichever op lowers validation loss.

import numpy as np
alpha = np.array([2.0, 1.0, 0.0])          # raw scores: conv3x3, skip, maxpool
w = np.exp(alpha) / np.exp(alpha).sum()     # softmax -> mixing weights
print(np.round(w, 3))   # -> [0.665, 0.245, 0.090]: conv dominates the blend
# edge output = w[0]*conv(x) + w[1]*skip(x) + w[2]*maxpool(x)

The convolution gets $\approx 67\%$ of the mixture, skip $24\%$, pool $9\%$. After training, you discretize: keep only the highest-weight operation (here the conv) on each edge. This turns an intractable discrete search into ordinary gradient descent, cutting the cost from thousands of GPU-days to a handful.

The pragmatic AutoML most people actually run. Full NAS is rare in practice; automated hyperparameter search is everywhere. Optuna does it in a few lines, sampling configurations and pruning bad trials early:

import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

def objective(trial):
    n  = trial.suggest_int("n_estimators", 50, 400)
    d  = trial.suggest_int("max_depth", 2, 16)
    mf = trial.suggest_float("max_features", 0.3, 1.0)
    clf = RandomForestClassifier(n_estimators=n, max_depth=d, max_features=mf)
    return cross_val_score(clf, X, y, cv=5).mean()

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=30)
print(study.best_params, round(study.best_value, 3))

This is the AutoML 90% of teams use: a search strategy (Optuna’s TPE sampler) exploring a defined space and scoring each candidate on validation — the same three-part loop as NAS, just over hyperparameters instead of wiring.

Warning

NAS notoriously overfits to its benchmark. An architecture tuned to squeeze the last 0.1% on CIFAR-10 often fails to transfer. Strong hand-designed baselines (a well-tuned ResNet or transformer) remain hard to beat, and a random search over a good space is a shockingly strong NAS baseline — always include it.

39.7 — Parameter-efficient fine-tuning (LoRA, adapters, prompts)

Fine-tuning a foundation model the old way means updating all its weights — for a 70-billion-parameter model that is a copy of all 70 billion weights per task, plus optimizer state on top. Imagine being told that to learn a new accent you must re-grow your entire brain. Parameter-efficient fine-tuning (PEFT) is the saner alternative: freeze the giant pretrained model and train only a tiny set of new parameters — often well under 1% of the total — that steer it toward the new task. One frozen base, many small swappable patches.

The dominant method is LoRA (Low-Rank Adaptation). The insight: the change a task needs is low-rank. Rather than learning a full weight update $\Delta W$ (huge — a $d \times d$ matrix), you learn it as a product of two skinny matrices $B$ ($d \times r$) and $A$ ($r \times d$) with a small rank $r$ like 8 or 16. The model uses $W + \Delta W$, but only $A$ and $B$ are trained:

\[ W' = W + \Delta W = W + \frac{\alpha}{r}\, B A, \qquad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times d},\; r \ll d \]

In words: “Keep the frozen weight $W$, and add a small correction built from two thin matrices — a down-projection to $r$ dimensions and an up-projection back — scaled by $\alpha/r$. Only those two thin matrices learn.” Also written: counting parameters, the full update has $d^2$ entries but the LoRA factors have only $2dr$; for $d = 4096, r = 8$ that is $\approx 65{,}000$ vs $16.7$ million — a $250\times$ reduction per layer.

A tiny numeric check shows where the savings come from — a full $d\times d$ update versus its rank-$r$ factorization:

import numpy as np
d, r = 4096, 8
full   = d * d                 # parameters in a dense weight update ΔW
lora   = 2 * d * r             # parameters in B (d×r) and A (r×d)
print(full, lora, round(full / lora, 1))   # 16,777,216  65,536  256.0x fewer

# the update IS a real matrix, just built cheaply from two thin ones:
B = np.random.randn(d, r) * 0.01
A = np.random.randn(r, d) * 0.01
dW = (1.0 / r) * B @ A          # full-size ΔW, rank ≤ r, stored as B,A
print(dW.shape, np.linalg.matrix_rank(dW))   # (4096, 4096), rank 8

The reconstructed $\Delta W$ is a full-sized matrix — so at inference you can fold it into $W$ and pay zero extra latency — yet it was stored and trained as two pencil-thin matrices.

LoRA is one of a family. Adapters insert tiny trainable bottleneck layers between frozen transformer blocks. Prompt / prefix tuning prepends a handful of learned “virtual token” vectors to the input and trains only those, leaving the model itself untouched. QLoRA combines LoRA with a 4-bit quantized frozen base, letting you fine-tune a 65B model on a single consumer GPU. The common thread: most of the knowledge already lives in the frozen base; a task needs only a small nudge.

In a real framework. Hugging Face’s PEFT library wraps any model in a LoRA config in a few lines:

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

base = AutoModelForCausalLM.from_pretrained("gpt2")       # frozen backbone
cfg  = LoraConfig(r=8, lora_alpha=16, target_modules=["c_attn"], lora_dropout=0.05)
model = get_peft_model(base, cfg)
model.print_trainable_parameters()
# -> trainable params: ~0.3M || all params: ~124M || trainable%: ~0.24

Only ~0.24% of the parameters carry gradients; the rest stay frozen. Because each adapter is a few megabytes, you can keep dozens of task-specific LoRAs and hot-swap them onto one shared base — the practical reason PEFT now dominates fine-tuning.

Tip

The rank $r$ is the main dial: bigger $r$ = more capacity to adapt but more parameters and more overfitting risk on small data. Start at $r = 8$–$16$, and remember the scaling factor $\alpha/r$ effectively controls how strongly the adapter speaks over the frozen base.

39.8 — Neurosymbolic AI

Neural networks are brilliant at perception and terrible at guarantees. They recognize a cat instantly but can’t reliably tell you that 17 is prime or that “all men are mortal, Socrates is a man” entails Socrates is mortal. Symbolic systems (Search & Problem Solving through Planning, CSP & Game Playing) are the mirror image: rigorous logic and exact reasoning, but brittle and unable to handle raw pixels. Neurosymbolic AI tries to fuse the two — neural perception feeding symbolic reasoning — to get the best of both.

The recurring pattern: a neural network turns messy input into discrete symbols, and a symbolic engine reasons over those symbols with rules and constraints.

flowchart LR
  IMG[Raw input: pixels, text] --> NN[Neural perception]
  NN --> SYM[Symbols / predicates]
  SYM --> REASON[Symbolic reasoning: rules, logic, solver]
  REASON --> OUT[Answer with provable structure]

A concrete example is visual question answering over a scene. The question “How many red cubes are left of the sphere?” is hopeless to answer end-to-end with raw regression, but easy if you decompose it: a neural net detects objects and attributes, then a symbolic program executes the query exactly. Below, the neural perception step has output a small symbolic table of objects, and the symbolic program filters and counts over it — deterministically, with no learning involved in the counting itself.

# Neural net's output: a symbolic scene table (color, shape, x-position)
scene = [
    {"color": "red",  "shape": "cube",   "x": 1},
    {"color": "blue", "shape": "sphere", "x": 5},
    {"color": "red",  "shape": "cube",   "x": 3},
    {"color": "red",  "shape": "sphere", "x": 7},
]
sphere_x = next(o["x"] for o in scene if o["shape"] == "sphere")   # = 5
# count(filter(red, cube, left_of(sphere)))
ans = sum(1 for o in scene
          if o["color"] == "red" and o["shape"] == "cube" and o["x"] < sphere_x)
print(ans)   # -> 2: the two red cubes at x=1 and x=3 are left of the sphere

The neural part handles perception (turning pixels into that table); the symbolic part handles counting and spatial logic exactly. The payoff is data efficiency, interpretability, and the ability to enforce hard constraints — a neurosymbolic system can be guaranteed never to violate a rule, which pure neural nets cannot promise. The challenge is making the symbolic step differentiable enough to train end-to-end, an area where work like DeepProbLog (probabilistic logic with neural predicates) and differentiable theorem provers is active.

Tip

The modern, lightweight version of neurosymbolic AI is an LLM that writes code or calls a tool: the model reasons in language, then offloads exact computation to a Python interpreter or calculator. Same division of labor — neural for the fuzzy parts, symbolic for the exact ones.

39.9 — World models and model-based agents

An agent that learns purely by trial and error in the real world is dangerous and slow — every mistake costs a real crash. Humans avoid this by imagining: we run a mental simulation of “if I do this, what happens next?” before acting. A world model gives an agent that ability — a learned internal model of environment dynamics that predicts the next state (and reward) given the current state and an action.

This is the heart of model-based reinforcement learning (Reinforcement Learning). A model-free agent learns a policy directly from experience; a model-based agent first learns the dynamics, then plans or trains a policy inside its own learned simulator — “dreaming” thousands of rollouts cheaply instead of acting them out for real.

flowchart LR
  E[Real environment] -->|few interactions| WM[Learn world model: s,a → s′,r]
  WM --> IM[Imagine rollouts internally]
  IM --> POL[Train / plan policy in imagination]
  POL -->|act| E

A tiny example shows planning inside a learned model. Suppose the agent has learned a one-dimensional dynamics model $s' = s + a$ with reward $r = -(s')^2$ (it wants to reach state 0), starting at $s = 3$. Instead of acting in the real world, it imagines two candidate action sequences and picks the higher-reward one entirely in its head — no real interaction spent.

def imagine(s, actions):           # roll out the LEARNED model, sum rewards
    total = 0.0
    for a in actions:
        s = s + a                   # learned dynamics: s' = s + a
        total += -(s**2)            # learned reward: closer to 0 is better
    return total

s0 = 3.0
planA = [-1.0, -1.0, -1.0]          # ease toward 0
planB = [-3.0,  0.0,  0.0]          # jump straight to 0
print(round(imagine(s0, planA), 1), round(imagine(s0, planB), 1))  # -5.0  0.0
# planB scores higher (0 > -5) -> agent commits to planB, having spent ZERO real steps

Plan B reaches state 0 immediately and scores $0$ versus Plan A’s $-5$, so the agent commits to Plan B — all decided in imagination. The famous Dreamer line of work scales exactly this idea with a compact latent world model: it doesn’t predict raw pixels but a low-dimensional code that captures the dynamics, and trains the policy entirely on imagined latent trajectories. The payoff is dramatic sample efficiency: because most learning happens in the cheap internal simulator, the agent needs far fewer real interactions. This matters enormously for robotics, where real interactions are slow and break hardware.

Warning

A world model is only as good as its predictions. Agents are expert at finding and exploiting the model’s errors — a planner will happily march toward a hallucinated high-reward state the model wrongly predicts. Model exploitation is the central failure mode; honest uncertainty estimates and short planning horizons are the usual guards.

39.10 — Embodied AI and robotics learning

Most ML lives in a disembodied world of static datasets. Embodied AI studies agents with a body — a robot arm, a legged robot, a navigating drone — that must perceive, act, and learn through physical interaction. The defining difference is the closed perception–action loop: the agent’s actions change what it sees next, so data is not given but generated by its own behavior.

Physical learning is hard for reasons software agents never face. Real robots are slow (one trajectory takes seconds, not microseconds), fragile (failures break hardware), and noisy (no two motors behave identically). The dominant strategy is sim-to-real transfer: train in a fast physics simulator where you can run millions of trials safely, then deploy on the real robot.

flowchart LR
  SIM[Fast simulator] -->|millions of trials| POL[Learned policy]
  POL --> DR[Domain randomization]
  DR -->|robust policy| REAL[Real robot]
  REAL -.fine-tune.-> POL

The gap between simulation and reality — the reality gap — is the core obstacle. The key trick is domain randomization: during training, randomly vary the simulator’s physics (friction, masses, lighting, sensor delay) across a wide range. A policy forced to succeed under all those variations treats the real world as just one more variation it can handle.

A small example shows why training across a range beats training on a single guessed value. Suppose the true (unknown) friction is $0.30$. A policy tuned to a single assumed friction performs worst when its assumption is far from reality; a policy trained across the whole range $[0.1, 0.5]$ pays a small, bounded cost everywhere — including at the true value.

import numpy as np
true_mu = 0.30
# error of a policy that assumed a single friction value, tested at true_mu
def err_single(assumed):  return abs(assumed - true_mu)
# error of a policy trained across the whole range: ~avg distance to the range
def err_dr(lo, hi):       return np.mean([abs(m - true_mu) for m in np.linspace(lo, hi, 9)])
print(round(err_single(0.10), 3),   # 0.20  unlucky single guess -> big error
      round(err_single(0.45), 3),   # 0.15  another unlucky guess
      round(err_dr(0.1, 0.5), 3))   # 0.12  randomized: bounded error, no bad luck

The single-value policies gamble on guessing right and lose badly when wrong ($0.20$, $0.15$); the domain-randomized policy carries a uniformly small error ($0.12$) because it never depended on any one precise value. By refusing to let the policy hinge on a number it cannot know, you make it robust to that number being unknown. A newer thread is robot foundation models (e.g. vision-language-action models): large policies pretrained across many robots and tasks that map camera images plus a language instruction directly to motor actions, bringing the foundation-model recipe to the physical world.

Tip

Domain randomization embodies a deep principle: when you don’t know a parameter’s true value, don’t guess it — train across its whole plausible range. Robustness to a distribution beats precision on a wrong point estimate.

39.11 — Reasoning models and test-time compute

For years the recipe for better models was “make them bigger.” A newer axis has opened: spend more compute at inference time, letting the model think longer before answering. The shift is from scaling training to scaling test-time compute — and it has produced a class of reasoning models that dramatically outperform their size on math, code, and logic.

The seed idea is chain-of-thought (CoT): prompt the model to “think step by step,” producing intermediate reasoning before the final answer. Generating the reasoning explicitly lets the model break a hard problem into easy steps, much as showing your work helps a student avoid arithmetic slips. Reasoning models internalize this — they are trained (often with reinforcement learning on verifiable answers) to generate long, self-correcting chains of thought as a matter of course.

From there, you can spend test-time compute in richer ways:

flowchart TB
  Q[Question] --> CoT[Single chain of thought]
  Q --> SC[Self-consistency: sample many chains, vote]
  Q --> ToT[Tree of thoughts: branch, evaluate, backtrack]
  CoT --> A[Answer]
  SC --> A
  ToT --> A

Self-consistency samples many independent chains and takes a majority vote over the final answers — different reasoning paths that agree are more likely correct. The mechanism is simple enough to show directly: imagine five sampled chains that, through different routes, land on the final answers below. Three say $12$, one says $11$, one says $9$; the majority vote returns $12$, discarding the two unlucky chains that slipped.

from collections import Counter
# final answers from 5 independently sampled chains of thought
answers = [12, 11, 12, 9, 12]
vote = Counter(answers).most_common(1)[0][0]
print(vote)   # -> 12: majority agrees, outvoting the two stray chains

Five chains set off from the same question and arrive at their own answers; the three that agree on 12 form the winning bloc, and that bloc gently pulses to mark the vote.

Why does voting help so reliably? If each chain is independently correct with probability $p > 0.5$, the chance the majority of $n$ chains is correct climbs toward 1 as $n$ grows — the same statistics that make a panel of mediocre judges beat one expert. The probability the majority is right is $\sum_{k > n/2} \binom{n}{k} p^k (1-p)^{n-k}$.

In words: “Add up the chance that more than half the chains happen to be correct — and if each chain is even slightly better than a coin flip, that sum rushes toward certainty as you sample more chains.” Also written: $\Pr[\text{majority correct}] = \Pr[\,\text{Binomial}(n, p) > n/2\,]$ — the upper tail of a binomial distribution.

from math import comb
def majority_correct(n, p):                 # prob the majority vote is right
    return sum(comb(n, k) * p**k * (1-p)**(n-k) for k in range(n//2 + 1, n + 1))
for n in [1, 5, 21]:
    print(n, round(majority_correct(n, 0.6), 3))   # 0.6 -> 0.683 -> 0.826
# each chain only 60% reliable, yet 21-chain vote is ~83% correct

Even when any single chain is only moderately reliable, agreement across independent chains concentrates probability on the correct answer — which is why self-consistency reliably lifts accuracy at the cost of generating several chains instead of one. Tree-of-thoughts goes further, exploring a branching tree of partial reasoning steps with evaluation and backtracking, turning inference into a search (echoing Chapter 32). The common thread: a compute–accuracy tradeoff at inference. More thinking — more samples, deeper search — buys more accuracy, and for the first time you can dial that knob after the model is trained.

Warning

A written chain of thought is not a faithful window into the model’s actual computation — models can reach the right answer for reasons their stated reasoning doesn’t reflect, and can produce plausible-but-wrong rationalizations. Treat CoT as a performance technique, not a guaranteed explanation (see Explainable AI & Interpretability).

39.12 — Open questions on the road toward general intelligence

Artificial general intelligence (AGI) — a system that learns and reasons across the full breadth of tasks a human can, rather than excelling at narrow ones — remains undefined and unbuilt. Today’s frontier models are astonishingly broad yet still fail in ways that reveal how far we are from general intelligence. It’s worth naming the open problems honestly rather than the hype.

flowchart LR
  ROOT[Open problems] --> R[Robustness]
  ROOT --> RE[Reasoning]
  ROOT --> G[Grounding]
  ROOT --> EF[Efficiency]
  ROOT --> AL[Alignment]
  R --> R1[Out-of-distribution failure]
  R --> R2[Adversarial brittleness]
  RE --> RE1[Reliable multi-step logic]
  RE --> RE2[True causality, not correlation]
  G --> G1[Connecting symbols to the world]
  G --> G2[Learning from interaction]
  EF --> EF1[Human-level data efficiency]
  EF --> EF2[Energy cost]
  AL --> AL1[Specifying what we want]
  AL --> AL2[Safety at scale]

Several stand out. Generalization beyond the training distribution — models still break on inputs unlike anything they’ve seen, where humans adapt gracefully. Genuine reasoning and causality — fluent text can mask shallow pattern-matching; distinguishing cause from correlation (Causal Inference) remains shaky. Data and energy efficiency — a child learns language from a few million words; frontier models need trillions of tokens and megawatts. Continual learning — humans learn for a lifetime without catastrophic forgetting (39.4); deployed models are essentially frozen after training. And alignment (AI Ethics, Fairness & Safety) — as systems grow more capable, reliably specifying and verifying that they do what we intend becomes the defining safety challenge.

Tip

A useful frame: progress in AI has come less from one master algorithm than from removing bottlenecks — first compute, then data, then labels (via self-supervision), now reasoning depth (via test-time compute). The next leap likely comes from removing whichever bottleneck binds hardest next: probably data efficiency, continual learning, or grounding in the world.

The animation below sketches that bottleneck-removal story as a relay — each constraint lifted hands off to the next.

39.13 — Meta-Learning and Multi-Task Learning in Depth

Imagine a tutor who has taught hundreds of students. When a new student walks in, the tutor doesn’t start from a blank slate — they already know roughly which explanations land, which mistakes are common, and how to adapt within the first few minutes. Meta-learning (“learning to learn”) aims for exactly this: instead of training one model for one task, we train across many tasks so that adapting to a new task takes only a handful of examples and a few gradient steps. Multi-task learning is the sibling idea — train one model on several tasks at once so they share structure and regularize each other.

The two differ in goal. Multi-task learning wants a single model that is good at all the tasks it was trained on. Meta-learning wants a starting point (or a comparison rule) that makes a brand-new, unseen task easy to pick up. We treat learning episodes, not just examples, as the unit of training.

The episodic setup

Meta-learning is trained on a distribution of tasks $p(\mathcal{T})$. Each task $\mathcal{T}_i$ comes as a tiny dataset split into a support set (the few labeled examples you adapt on) and a query set (held-out examples you’re scored on). A “5-way 1-shot” classification episode means 5 classes, 1 support example each — adapt, then classify the queries.

flowchart LR
  D[Task distribution p(T)] --> S1[Episode 1: support + query]
  D --> S2[Episode 2: support + query]
  D --> S3[Episode N: support + query]
  S1 --> M[Meta-learner]
  S2 --> M
  S3 --> M
  M --> A[Fast adaptation to NEW task]

The whole game is the outer loop (across tasks, slow, updates the meta-parameters) wrapped around an inner loop (within a task, fast, adapts to the support set).

Optimization-based: MAML and Reptile

MAML (Model-Agnostic Meta-Learning) asks a sharp question: what single set of weights $\theta$ is positioned so that one or a few gradient steps on any task’s support set lands in a good spot for that task’s queries? It learns an initialization, not a final answer.

The inner loop adapts $\theta$ to task $i$ with one SGD step on the support loss:

\[\theta_i' = \theta - \alpha \nabla_\theta \mathcal{L}_{\mathcal{T}_i}^{\text{support}}(\theta)\]

In words: “Take the shared weights and do one practice step on this task’s few examples to get task-specialized weights.” Also written: as a gradient-descent map $\theta_i' = g(\theta) = \theta - \alpha\, \mathbf{g}_i$ where $\mathbf{g}_i = \nabla_\theta \mathcal{L}_{\mathcal{T}_i}^{\text{support}}(\theta)$.

The outer loop then nudges the original $\theta$ so that these adapted weights $\theta_i'$ do well on the query set:

\[\theta \leftarrow \theta - \beta \nabla_\theta \sum_i \mathcal{L}_{\mathcal{T}_i}^{\text{query}}(\theta_i')\]

In words: “Adjust the shared starting weights so that the practiced weights score well on held-out examples — improving how well adaptation generalizes, not how well it memorizes.” Also written: since $\theta_i'$ depends on $\theta$, the chain rule gives $\theta \leftarrow \theta - \beta \sum_i (\partial \theta_i'/\partial\theta)^\top \nabla_{\theta_i'}\mathcal{L}_i^{\text{query}}$, and $\partial\theta_i'/\partial\theta = I - \alpha \nabla^2_\theta \mathcal{L}_i^{\text{support}}$ — the second-order term.

Notice the subtlety: the outer gradient differentiates through the inner update, so it contains a second derivative (a gradient of a gradient). That is the source of MAML’s power and its cost.

Tip

The intuition: MAML doesn’t seek weights that are good now; it seeks weights that are one step away from being good on anything. It optimizes for adaptability, not performance.

Reptile is the lazy, beautiful shortcut. Skip the second derivatives entirely. For each task, just run a few SGD steps to get $\tilde\theta_i$, then move the meta-parameters toward that adapted point:

\[\theta \leftarrow \theta + \beta\,(\tilde\theta_i - \theta)\]

In words: “After practicing a while on a task, take a small step in the direction the weights drifted — repeat across tasks and you settle near a point all tasks agree on.” Also written: as an exponential moving average toward adapted solutions, $\theta \leftarrow (1-\beta)\,\theta + \beta\,\tilde\theta_i$.

That’s it — no differentiating through the inner loop. Surprisingly, this still works, because averaging “where each task pulled the weights” finds a point near all of them in a way that correlates with MAML’s objective.

Here is Reptile’s entire core, from scratch:

import numpy as np

def reptile_step(theta, task_grad_fn, inner_steps=5, alpha=0.05, beta=0.1):
    w = theta.copy()                      # clone meta-weights
    for _ in range(inner_steps):          # inner loop: adapt on this task
        w -= alpha * task_grad_fn(w)      # plain SGD, no graph kept
    return theta + beta * (w - theta)     # outer: step toward adapted point

# tiny sanity check: tasks are "fit a scalar to a target", target ~ task
def make_task():
    target = np.random.randn()
    return lambda w: 2 * (w - target)     # grad of (w - target)^2

theta = np.array([0.0])
for _ in range(2000):
    theta = reptile_step(theta, make_task())
print(theta)   # converges near the MEAN target (~0), a good init for any task

The toy result is telling: with targets drawn around 0, Reptile parks $\theta$ near the mean — the spot from which any task is reachable in a few steps. That is the meta-learning prize in miniature.

Warning

MAML’s second-order gradients are memory-hungry and unstable on deep nets; the common first-order approximation (FOMAML) drops them and often matches full MAML, which is itself a hint that Reptile’s gradient-free trick is not a coincidence.

Metric-based: Prototypical and Matching Networks

Optimization-based methods adapt weights. Metric-based methods don’t adapt at all — they learn an embedding space where classification is just “find the nearest thing.” The adaptation is cheap because comparison, not training, does the work.

Prototypical Networks are the cleanest version. Embed every support example with a learned encoder $f_\phi$. For each class $c$, average its support embeddings into a single prototype:

\[\mathbf{p}_c = \frac{1}{|S_c|}\sum_{x_j \in S_c} f_\phi(x_j)\]

In words: “A class’s prototype is just the average location of its few examples in the learned feature space — its center of mass.” Also written: $\mathbf{p}_c = \frac{1}{|S_c|}\,F_c^\top \mathbf{1}$, the column-mean of the stacked support-embedding matrix $F_c$.

Classify a query $x$ by softmax over negative distances to the prototypes:

\[P(y = c \mid x) = \frac{\exp(-\|f_\phi(x) - \mathbf{p}_c\|^2)}{\sum_{c'} \exp(-\|f_\phi(x) - \mathbf{p}_{c'}\|^2)}\]

In words: “Score each class by how close the query lands to its center, then turn those closeness scores into probabilities — nearest center wins, but softly.” Also written: since $-\|f-\mathbf p_c\|^2 = 2\,\mathbf p_c^\top f - \|\mathbf p_c\|^2 - \|f\|^2$ and $\|f\|^2$ cancels in the softmax, this is a linear classifier with weights $2\mathbf p_c$ and biases $-\|\mathbf p_c\|^2$.

A worked micro-example in 2-D. Class A support embeddings are $(1,1)$ and $(1,2)$, so $\mathbf{p}_A=(1,1.5)$. Class B support is $(5,5)$ and $(5,6)$, so $\mathbf{p}_B=(5,5.5)$. A query at $(2,2)$ has squared distance $1+0.25=1.25$ to A and $9+12.25=21.25$ to B — clearly class A. No weight updates, just arithmetic in embedding space.

Matching Networks are the soft-attention cousin: instead of collapsing each class to one prototype, they compare the query to every support example and take an attention-weighted vote over their labels — useful when classes are multimodal and a single mean would blur them.

The trade: metric methods are fast and stable but assume a fixed comparison rule (Euclidean or cosine) is enough; optimization methods are more flexible but pay in compute and tuning.

Multi-task architectures and the gradient-conflict problem

Multi-task learning shares a backbone across tasks. The first design choice is how much to share.

Hard parameter sharing uses one shared trunk and small per-task heads — the classic, strong-regularizer default. Soft parameter sharing gives each task its own network but penalizes the weights for drifting apart, trading parameters for flexibility.

flowchart TB
  X[Input] --> B[Shared backbone]
  B --> H1[Head: task A]
  B --> H2[Head: task B]
  B --> H3[Head: task C]

FiLM (Feature-wise Linear Modulation) is a lightweight middle path: keep one shared backbone but let a task identifier produce per-task scale $\gamma$ and shift $\beta$ vectors that modulate the features, $\text{FiLM}(h) = \gamma \odot h + \beta$. One body, conditioned on which task it’s doing — cheap and surprisingly expressive.

In words: “Run one shared network, but for each task, stretch and slide its features with a task-specific dial — same machinery, different settings per task.” Also written: per-channel, $\text{FiLM}(h)_c = \gamma_c\, h_c + \beta_c$ — an affine (scale-and-shift) transform applied feature-wise, with $\gamma, \beta$ predicted from the task id.

The deep problem in multi-task training is gradient conflict: two tasks can want the shared weights to move in opposing directions. When their gradients point more than 90° apart, the sum can cancel useful signal, and a loud task can drown a quiet one.

PCGrad (“gradient surgery”) detects conflict by the dot product of two task gradients. If $g_i \cdot g_j < 0$, it projects $g_i$ onto the plane orthogonal to $g_j$, removing only the conflicting component:

\[g_i \leftarrow g_i - \frac{g_i \cdot g_j}{\|g_j\|^2}\, g_j\]

In words: “If task $i$’s gradient pushes partly against task $j$’s, subtract off just that opposing part — keep what helps $i$, drop what fights $j$.” Also written: as a projection, $g_i \leftarrow (I - \hat g_j \hat g_j^\top)\, g_i$ with $\hat g_j = g_j/\|g_j\|$ — orthogonal projection of $g_i$ out of $g_j$’s direction.

Geometrically the surgery snips off only the component of $g_i$ that points back along $g_j$, leaving the part that runs perpendicular — the help survives, the conflict is removed.

import numpy as np

def pcgrad(grads):                         # grads: list of per-task gradient vectors
    out = [g.copy() for g in grads]
    for i in range(len(out)):
        for j in range(len(grads)):
            if i == j: continue
            dot = out[i] @ grads[j]
            if dot < 0:                    # conflict: remove the bad component
                out[i] -= dot / (grads[j] @ grads[j]) * grads[j]
    return sum(out)                        # combined, de-conflicted update

# task A pulls +x, task B pulls -x but +y; the +x/-x clash is surgically removed
gA = np.array([1.0,  0.0])
gB = np.array([-1.0, 1.0])
print(pcgrad([gA, gB]))                    # net update no longer fully cancels

GradNorm attacks the magnitude side instead: it dynamically rescales each task’s loss weight so that all tasks train at a similar rate, preventing the fast learner from hogging the shared backbone. PCGrad fixes direction; GradNorm balances speed. In practice they target the two ways multi-task training goes wrong — conflicting gradients and imbalanced ones — and are often worth more than yet another architecture tweak.

Tip

Before reaching for gradient surgery, try the cheapest fix first: just tune the per-task loss weights (even a fixed hand-set ratio). A lot of “gradient conflict” is really one task’s loss being numerically larger; rebalancing the weights often recovers most of the benefit with none of the machinery.

39.14 — Self-Supervised and Contrastive Learning in Depth

Labels are expensive; raw data is nearly free. Self-supervised learning (SSL) turns that asymmetry into a strategy: hide part of the data and train the model to predict it from the rest. The “labels” are manufactured from the input itself, so a billion unlabeled images or a trillion words of text become supervision for free. The payoff is a representation — a feature space — that you can later fine-tune on a small labeled task.

The intuition is that to fill in a blank well, a model must understand structure. Predict the missing word and you learn syntax and semantics; predict the missing image patch and you learn objects and textures. Good predictions require good understanding, so the pretext task smuggles real learning in through the back door.

Pretext tasks: the first generation

Early SSL invented clever puzzles: predict the rotation applied to an image (0°, 90°, 180°, 270°), solve a jigsaw of shuffled patches, or colorize a grayscale photo. These pretext tasks worked but felt arbitrary — the puzzle was a means to an end, and the features were a side effect. The field then split into three cleaner families: contrastive, non-contrastive, and masked modeling.

Contrastive learning: SimCLR, MoCo, and InfoNCE

The contrastive idea is almost embarrassingly simple: pull together two views of the same thing; push apart views of different things. Take an image, make two random augmentations (crop, color jitter, blur) — these form a positive pair. Every other image in the batch is a negative. Train the embedding so positives are close and negatives are far.

The workhorse loss is InfoNCE (Noise-Contrastive Estimation). For a positive pair $(i, j)$ with embeddings $z_i, z_j$ and similarity $\text{sim}$ (cosine), over a batch of $2N$ views:

\[\mathcal{L}_{i,j} = -\log \frac{\exp(\text{sim}(z_i, z_j)/\tau)}{\sum_{k \ne i} \exp(\text{sim}(z_i, z_k)/\tau)}\]

In words: “Among everyone in the batch, treat finding my true partner as a classification problem — penalize the model when the right partner isn’t the clear winner.” Also written: as a temperature-scaled softmax cross-entropy, $\mathcal{L}_{i,j} = -\,s_{ij}/\tau + \log\sum_{k\ne i}\exp(s_{ik}/\tau)$ with $s_{ab} = \text{sim}(z_a, z_b)$.

It is just a softmax classification: “which of all these candidates is my true partner?” The temperature $\tau$ sharpens or softens the contrast.

flowchart LR
  IMG[Image] --> A1[Augment view 1]
  IMG --> A2[Augment view 2]
  A1 --> E1[Encoder f]
  A2 --> E2[Encoder f]
  E1 --> Z1[z_i]
  E2 --> Z2[z_j]
  Z1 -->|pull together| Z2
  Z1 -.->|push apart| N[Negatives in batch]

A worked number. Suppose temperature $\tau = 0.1$, the positive pair has cosine similarity $0.9$, and two negatives sit at $0.1$ and $0.2$. The logits are $9$, $1$, and $2$. Softmax over $\{9,1,2\}$ puts about $\frac{e^{9}}{e^{9}+e^{1}+e^{2}} \approx 0.9994$ on the positive — loss $\approx -\log 0.9994 \approx 0.0006$, near zero, because the positive already dominates. Shrink the gap (positive at $0.3$) and the loss balloons, pushing the encoder to separate them.

import numpy as np

def info_nce(z_i, z_j, negatives, tau=0.1):
    sim = lambda a, b: a @ b / (np.linalg.norm(a)*np.linalg.norm(b))
    pos = np.exp(sim(z_i, z_j) / tau)
    denom = pos + sum(np.exp(sim(z_i, n) / tau) for n in negatives)
    return -np.log(pos / denom)

z_i = np.array([1.0, 0.0]); z_j = np.array([0.95, 0.31])   # similar view
negs = [np.array([0.0, 1.0]), np.array([-1.0, 0.1])]       # dissimilar
print(round(info_nce(z_i, z_j, negs), 4))                  # small loss

SimCLR showed this works spectacularly with three ingredients: heavy augmentation, a projection head (a small MLP applied before computing the loss, then discarded), and — crucially — large batches, because negatives come from the batch and more negatives mean a harder, more informative contrast.

That batch dependence is a real burden. MoCo (Momentum Contrast) removes it by keeping a queue of negatives from recent batches and encoding them with a slowly-updated momentum encoder ($\theta_k \leftarrow m\,\theta_k + (1-m)\,\theta_q$, with $m$ near $0.999$). The queue decouples the number of negatives from the batch size, so you get thousands of negatives on modest hardware.

In words (the momentum update): “The negative-encoder’s weights are a slow-motion copy of the main encoder — mostly themselves, with a tiny dribble of the latest weights mixed in — so the negatives stay consistent across batches.” Also written: as an exponential moving average, $\theta_k = m\,\theta_k^{\text{old}} + (1-m)\,\theta_q$; with $m=0.999$ the key encoder moves $$1000× slower than the query encoder.

Warning

Contrastive methods can collapse to a constant if negatives are too easy, and they are sensitive to the augmentation recipe — too weak and positives are trivially close, too strong and they no longer share content. The augmentation policy is a real hyperparameter, not a detail.

Non-contrastive learning: BYOL and DINO

Here is the puzzle that made the field nervous: do we even need negatives? If you only pull positives together with nothing pushing apart, the obvious solution is collapse — map everything to the same vector, loss zero, useless features. BYOL (Bootstrap Your Own Latent) showed you can avoid collapse without a single negative.

BYOL runs two networks: an online network that has an extra predictor head, and a target network that is an exponential moving average of the online one. The online network tries to predict the target’s embedding of a different augmented view. The asymmetry — the predictor on one side, the stop-gradient and EMA on the other — is what prevents collapse. The target is a slow-moving, more stable teacher the student chases but never quite catches, and that lag keeps the representation from degenerating.

flowchart LR
  V1[View 1] --> ON[Online encoder] --> PR[Predictor] --> P[prediction]
  V2[View 2] --> TG[Target encoder EMA] --> T[target / stop-grad]
  P -->|match| T
  ON -.EMA update.-> TG

DINO (self-distillation with no labels) brings the same teacher–student idea to Vision Transformers. The student matches the teacher’s output distribution over views, with centering (subtract a running mean to stop one dimension dominating) and sharpening (low teacher temperature) as the twin tricks that prevent collapse. DINO’s surprise bonus: the attention maps of the trained ViT segment objects with no segmentation labels ever shown — emergent structure, free.

Masked modeling and JEPA

The other great branch borrows directly from language modeling’s masked-prediction recipe. MAE (Masked Autoencoder) masks a large fraction of image patches — typically 75% — feeds only the visible ones to an encoder, then asks a lightweight decoder to reconstruct the missing pixels.

flowchart LR
  IMG[Patches] --> MASK[Mask 75 percent]
  MASK --> ENC[Encoder: visible only] --> DEC[Decoder] --> REC[Reconstruct masked pixels]

Two design choices make MAE efficient and effective. The high mask ratio makes the task hard enough to force real understanding (you can’t cheat by copying neighbors). And feeding only visible patches to the encoder cuts compute dramatically, since the encoder processes a quarter of the patches. After pretraining you throw the decoder away and keep the encoder.

But reconstructing pixels spends capacity on irrelevant detail — exact textures, lighting — that a good representation shouldn’t care about. JEPA (Joint-Embedding Predictive Architecture) is the answer: predict in representation space, not pixel space. Mask part of the input, encode the visible part, and predict the embeddings of the masked part rather than its raw values. You’re predicting “what would the features there look like,” which lets the model ignore unpredictable low-level noise and focus on structure.

flowchart LR
  CTX[Context patches] --> CE[Context encoder] --> PRED[Predictor]
  TGT[Target patches] --> TE[Target encoder EMA] --> TZ[target embeddings]
  PRED -->|predict in embedding space| TZ

JEPA sits at an interesting crossroads of everything above: it predicts like masked modeling, but in latent space like the non-contrastive methods, using a stop-gradient/EMA target to avoid collapse like BYOL. The unifying view across this whole section is that SSL is the search for a prediction task hard enough to demand understanding but not so literal that it wastes effort on noise — and the frontier has been steadily moving that prediction from raw pixels toward abstract representations.

Where this shows up. These are not just lab curiosities. DINO-style features power off-the-shelf image search and zero-shot segmentation; MAE pretraining is a standard warm-start for medical-imaging and satellite models where labels are scarce; and the “predict in representation space” idea behind JEPA underlies video and world-model pretraining where reconstructing every pixel would be hopeless. When a team says they “pretrained a backbone on unlabeled data and fine-tuned on our small set,” they are almost always running one of the six rows in the table below.

Method	Family	Needs negatives?	Predicts in	Anti-collapse trick
SimCLR	Contrastive	Yes (large batch)	Embedding	Negatives
MoCo	Contrastive	Yes (queue)	Embedding	Negatives + momentum
BYOL	Non-contrastive	No	Embedding	Predictor + EMA target
DINO	Non-contrastive	No	Distribution	Centering + sharpening
MAE	Masked	No	Pixels	Reconstruction target
JEPA	Masked / joint	No	Embedding	EMA target + stop-grad

39.15 — Quick reference

Term / method	One-line meaning	When / why it matters
Self-supervised learning (SSL)	Invent a label from the data itself (predict a hidden part)	Train on unlabeled data at internet scale; the engine of foundation models
InfoNCE / contrastive	Softmax CE that pulls positive pairs close, pushes negatives far	Learn embeddings without labels; watch the temperature $\tau$ and collapse
Transfer learning	Reuse a pretrained model; retrain only the head	Strong classifier from tiny labeled data (e.g. 500 X-rays)
Multi-task learning	One shared backbone, many per-task heads	Tasks regularize each other; mind gradient conflict (PCGrad / GradNorm)
MAML / Reptile	Learn an initialization one step away from good on any task	Few-shot adaptation; Reptile skips MAML’s second-order cost
Prototypical networks	Classify by nearest class-mean in embedding space	Few-shot, no weight updates — just arithmetic
Zero-shot (CLIP)	Classify via closeness to label text embeddings	Recognize never-seen classes from a description alone
EWC	Quadratic penalty pinning high-Fisher weights to old values	Continual learning without catastrophic forgetting
FedAvg	Data-weighted average of clients’ local models	Train where data can’t move; add DP + secure aggregation for privacy
$(\varepsilon,\delta)$-DP	Output barely changes if any one record is added/removed	Provable privacy; smaller $\varepsilon$ = more noise = less accuracy
DARTS (NAS)	Softmax mixture over candidate ops, trained by gradient descent	Differentiable architecture search; still beat it with random-search baselines
LoRA / PEFT	Freeze base, train a rank-$r$ update $\frac{\alpha}{r}BA$ (<1% of weights)	Cheap fine-tuning; hot-swap many task adapters onto one base
Neurosymbolic	Neural perception → symbols → exact symbolic reasoning	Data efficiency, interpretability, hard-constraint guarantees
World model	Learned dynamics $s,a\to s',r$ to plan in imagination	Sample-efficient model-based RL; beware model exploitation
Domain randomization	Train across a range of sim parameters, not one guess	Robust sim-to-real transfer when the true value is unknown
Chain-of-thought / self-consistency	Think step by step; sample many chains and majority-vote	Test-time compute buys accuracy; CoT isn’t a faithful explanation

39.16 — Key takeaways

Self-supervised learning removes the label bottleneck by predicting hidden parts of the data; it is the engine behind every foundation model. Contrastive learning (InfoNCE) pulls positive pairs together and pushes negatives apart, with the denominator running over the positive plus all negatives — watch for collapse.
Transfer, multi-task, and meta-learning form a ladder: reuse one model, train on many tasks at once, or learn an initialization that adapts fast (MAML lands at the point equidistant from all task optima).
Few-shot classifies from a handful of examples (prototypical networks); zero-shot classifies unseen classes via shared text embeddings (CLIP).
Catastrophic forgetting is the central obstacle to continual learning; defenses are regularization (EWC pins high-Fisher weights), replay, and per-task parameters — replay is a strong simple baseline.
Federated learning trains on data that never moves (FedAvg); true privacy needs differential privacy and secure aggregation on top.
AutoML / NAS automate model design; DARTS made architecture search differentiable via a softmax mixture over candidate ops — but always compare against strong hand-designed and random-search baselines.
Parameter-efficient fine-tuning (LoRA, adapters, prompt tuning) freezes the giant base model and trains under 1% new parameters — the practical way to adapt foundation models and hot-swap many tasks onto one backbone.
Neurosymbolic AI fuses neural perception with symbolic reasoning for data efficiency, interpretability, and hard constraints; the lightweight modern form is an LLM calling tools.
World models let agents learn dynamics and plan in imagination for huge sample efficiency; beware model exploitation. Embodied AI closes the perception–action loop and bridges the reality gap with sim-to-real and domain randomization.
Reasoning models scale test-time compute (chain-of-thought, self-consistency voting, tree-of-thoughts), trading inference compute for accuracy — but CoT is not a faithful explanation.
The road to general intelligence is gated by open problems: out-of-distribution robustness, genuine reasoning and causality, data/energy efficiency, continual learning, and alignment.

39.17 — See also

Attention & Transformers — the architecture underlying self-supervised foundation models.
Large Language Models — in-context learning, chain-of-thought, and the scaling story in depth.
Multimodal AI — CLIP and the joint image–text space behind zero-shot recognition.
Reinforcement Learning — model-free vs. model-based agents that world models build on.
Model Evaluation & Tuning — hyperparameter optimization, the foundation AutoML automates.
Causal Inference — cause vs. correlation, one of the core open reasoning problems.
Explainable AI & Interpretability — why a chain of thought is not a guaranteed explanation.
AI Ethics, Fairness & Safety — alignment, the defining challenge as capability scales.

↪ The thread continues → Chapter 40 · 🔗 Graph Machine Learning

The frontier isn’t only about scale; whole data shapes stayed underserved. Networks of relationships — molecules, social graphs, knowledge bases — demand their own deep learning.

📖 All chapters | ← 38 · ⚖️ AI Ethics, Fairness & Safety | 40 · 🔗 Graph Machine Learning →