Chapter 19 — 👁️ Computer Vision

📖 All chapters | ← 18 · 🎨 Generative Models | 20 · 💬 Natural Language Processing →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Computer vision is the subfield of applied AI that teaches machines to extract meaning from pixels — to answer what is in an image, where it is, and which exact pixels belong to it. It sits at the intersection of deep learning (Neural Networks, CNNs, sequence models, Transformers, generative models) and a specific data type: the 2-D grid of brightness values that is a digital image. This chapter walks the ladder of vision tasks from easiest to hardest — classification, then detection, then segmentation — and then the practical workhorses (OCR, pose, tracking, augmentation) that turn those models into real systems.

🧭 In context: Applied AI built on CNNs and Transformers · used to recognize, locate, and trace objects in images and video · the one key idea — vision tasks form a ladder of spatial precision, from one label per image to one label per pixel.

💡 Remember this: Every vision model is a function from a grid of pixel numbers to a structured answer, and the tasks form a ladder of spatial precision — pick the lowest rung (classify → detect → segment) that solves your problem.

Before any of these tasks, it helps to be concrete about what a “pixel” actually is. A grayscale image is a 2-D grid of numbers, each from 0 (black) to 255 (white). A color image is three such grids stacked — red, green, blue — so a \(224 \times 224\) color photo is really a \(224 \times 224 \times 3\) block of numbers, about 150,000 of them. Every vision model, no matter how fancy, is ultimately a function from that block of numbers to some structured answer. Here is the same tiny image as a grid and as what we “see”:

Loading a real image with a framework makes the “grid of numbers” idea tangible. The same picture is just a tensor you can slice, average, and reshape:

import numpy as np
from PIL import Image
img = Image.open("cat.jpg").convert("RGB")
arr = np.asarray(img)              # shape (H, W, 3), dtype uint8, values 0..255
print(arr.shape, arr.dtype)        # e.g. (224, 224, 3) uint8
print(arr[0, 0])                   # the top-left pixel, e.g. [137 122  98]
print(arr.mean(axis=(0, 1)))       # average R, G, B over the whole image

The three core tasks differ only in how precisely they localize. A picture is worth a thousand words here.

flowchart LR
    A[Input image] --> B[Classification<br/>one label / image]
    A --> C[Detection<br/>box + label / object]
    A --> D[Segmentation<br/>label / pixel]
    B --> B1["'cat'"]
    C --> C1["'cat' @ box(x,y,w,h)"]
    D --> D1["mask: which pixels are cat"]

19.1 — Image Classification & Recognition

Image classification is the foundational vision task: given a whole image, output a single category label. “Is this a cat or a dog?” The model never says where the cat is — it commits to one answer for the entire frame. This is the rung everything else is built on, because the features a classifier learns (edges, textures, parts, objects) are reused by detectors and segmenters.

Two architectures dominate, and the difference is easiest to feel as a contrast in how they look. A CNN is like reading a page through a small magnifying glass slid across it — it sees a little neighborhood at a time and builds up understanding locally. A Vision Transformer is more like glancing at the whole page and letting every word talk to every other word at once. The CNN’s local habit is a built-in head start (nearby pixels usually belong together); the ViT’s everything-sees-everything freedom needs more data to pay off but then surpasses it.

The dominant tool for two decades has been the convolutional neural network (CNN) — a stack of learned filters that slide across the image detecting local patterns, getting more abstract with depth (Chapter 15 owns the mechanics). More recently the Vision Transformer (ViT) chops the image into a grid of patches, treats each patch like a word token, and runs self-attention over them (Chapter 17 owns attention). CNNs bake in the assumption that nearby pixels matter most (a strong, data-efficient inductive bias); ViTs drop that assumption and instead learn relationships from large data via self-attention, winning when data is plentiful.

How the output works. The final layer produces one logit (raw score) per class. A softmax turns the logits into probabilities that sum to 1:

\[p_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\]

In words: exponentiate every class score so it becomes positive, then divide each by the total so the scores become a probability distribution that adds to 1.

Also written: \(\mathbf{p} = \operatorname{softmax}(\mathbf{z})\), the vectorized form over the whole logit vector \(\mathbf{z}\).

The predicted class is the highest probability. We measure quality with top-1 accuracy (the single top guess is correct) and top-5 accuracy (the true label is among the model’s five most confident guesses) — top-5 is forgiving on datasets like ImageNet where many classes are near-synonyms (e.g. dog breeds).

Tip

Top-5 was popular precisely because ImageNet has 1000 fine-grained classes — many images legitimately contain several objects, so demanding the exact top guess understates a model that clearly “sees” the right thing.

A worked softmax over three classes:

import numpy as np
logits = np.array([2.0, 1.0, 0.1])          # raw scores for [cat, dog, fox]
exp = np.exp(logits - logits.max())          # subtract max for numerical stability
probs = exp / exp.sum()
print(probs.round(3))                         # [0.659 0.242 0.099]
print("top-1:", probs.argmax())               # 0 -> 'cat'
# top-5 here would just be all classes ranked; true label in top-5 == correct
assert abs(probs.sum() - 1.0) < 1e-9          # probabilities are normalized

Doing it for real with a pretrained model. In practice you almost never train a classifier from scratch — you grab a model pretrained on ImageNet and run it (or fine-tune it). A handful of lines with torchvision:

import torch
from torchvision.models import resnet50, ResNet50_Weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights).eval()         # pretrained on ImageNet
preprocess = weights.transforms()                # resize/crop/normalize to match training
from PIL import Image
x = preprocess(Image.open("cat.jpg")).unsqueeze(0)   # add batch dim -> (1,3,224,224)
with torch.no_grad():
    logits = model(x)
probs = logits.softmax(dim=1)
top5 = probs.topk(5)
for p, idx in zip(top5.values[0], top5.indices[0]):
    print(f"{weights.meta['categories'][idx]:<20} {p.item():.3f}")

Face recognition is a specialized recognition task. Instead of classifying into a fixed list, it maps each face to a fixed-length embedding vector such that the same person’s faces land close together and different people land far apart. At inference you compare a new face’s embedding to a gallery by distance — so the system recognizes people it was never explicitly trained to classify (open-set recognition). Training uses losses like triplet loss, which pulls an anchor toward a positive (same person) and pushes it from a negative (different person):

\[\mathcal{L} = \max\big(0,\ \|a - p\|^2 - \|a - n\|^2 + \alpha\big)\]

In words: the loss is zero once the anchor is at least a margin \(\alpha\) closer to the same-person photo than to the different-person photo; otherwise it grows with how badly that gap is violated.

Also written: \(\mathcal{L} = [\,d(a,p) - d(a,n) + \alpha\,]_+\) with \(d(\cdot,\cdot)\) the squared Euclidean distance and \([\,\cdot\,]_+ = \max(0,\cdot)\) the hinge.

where \(\alpha\) is a margin. The idea is geometric: think of every face as a point on a sphere. Training tugs the anchor and positive together while shoving the negative at least a margin \(\alpha\) farther away. Once trained, “is this the same person?” becomes a simple distance check.

The same embedding trick powers image retrieval and is the conceptual bridge to the metric learning used elsewhere in vision. Once faces are embeddings, “same person?” is one cosine-similarity call:

import torch
import torch.nn.functional as F
# emb_a, emb_b: 512-dim face embeddings from a model like FaceNet / ArcFace
emb_a = F.normalize(torch.randn(1, 512), dim=1)
emb_b = F.normalize(torch.randn(1, 512), dim=1)
sim = F.cosine_similarity(emb_a, emb_b).item()
print("same person" if sim > 0.6 else "different")   # 0.6 = tuned threshold

Transfer learning and fine-tuning

Think of a pretrained network as someone who already knows how to see — edges, textures, shapes — and only needs to be told which categories you care about. Transfer learning reuses a model trained on a huge dataset (ImageNet) as the starting point for your own, much smaller, problem. This is the single most important practical technique in vision: it routinely turns “I have 500 labelled images” from hopeless into a working model.

There are two common modes. Feature extraction freezes the pretrained backbone and trains only a fresh classifier head on top — cheap, fast, and a strong baseline when your data is scarce or very similar to ImageNet. Fine-tuning unfreezes some or all of the backbone and continues training at a small learning rate, letting the features adapt to your domain — better when you have more data or a domain far from natural photos (medical scans, satellite imagery).

import torch.nn as nn
from torchvision.models import resnet18, ResNet18_Weights
model = resnet18(weights=ResNet18_Weights.DEFAULT)
for p in model.parameters():        # 1) freeze the backbone
    p.requires_grad = False
model.fc = nn.Linear(model.fc.in_features, 3)   # 2) new head: 3 classes (only this trains)
# train as usual; optimizer sees only model.fc parameters.
# To fine-tune later: unfreeze with `p.requires_grad = True` and use a tiny LR (e.g. 1e-5).

Tip

Rule of thumb: little data + similar domain → freeze and train a head. More data or a distant domain → fine-tune with a learning rate roughly 10× smaller than you’d use from scratch, so you nudge the pretrained features rather than smashing them.

19.2 — Object Detection

Classification answers what; object detection answers what and where, for every object at once. The output is a list of bounding boxes — each a rectangle \((x, y, w, h)\) — paired with a class label and a confidence score. A single street photo might return ten boxes: three cars, two pedestrians, a traffic light. This is harder than classification because the number of objects is unknown and they can overlap.

The R-CNN family took the “propose then classify” route. The original R-CNN used an external algorithm to suggest ~2000 candidate regions, then ran a CNN on each — accurate but painfully slow. Fast R-CNN ran the CNN once over the whole image and cropped features per region. Faster R-CNN replaced the external proposer with a learned Region Proposal Network (RPN), making the whole thing end-to-end trainable. These two-stage detectors are accurate but heavier.

Single-stage detectors — YOLO (“You Only Look Once”) and SSD (“Single Shot Detector”) — skip the proposal step. They divide the image into a grid and predict boxes and classes directly in one forward pass. Faster, ideal for real-time/video, historically a touch less accurate on small objects — though modern YOLO versions have largely closed that gap.

	Two-stage (R-CNN family)	Single-stage (YOLO, SSD)
Pipeline	Propose regions, then classify	Predict boxes + classes in one pass
Speed	Slower	Fast, real-time capable
Accuracy	Historically higher, esp. small objects	Very close on modern versions
Typical use	Offline / precision-critical	Video, edge, live streams

flowchart TD
    A[Input image] --> B[Backbone CNN/ViT<br/>extract features]
    B --> C{Detector type}
    C -->|Two-stage| D[Region Proposal Network<br/>~candidate boxes]
    D --> E[Per-region classify + refine box]
    C -->|Single-stage| F[Dense grid prediction<br/>boxes + classes in one pass]
    E --> G[Raw boxes + scores]
    F --> G
    G --> H[NMS<br/>drop duplicate overlapping boxes]
    H --> I[Final detections]

Anchors are the trick that lets a grid predict variably-shaped objects. At each grid cell the model places several pre-defined box templates (anchor boxes) of different sizes and aspect ratios — a tall one for pedestrians, a wide one for cars — and learns offsets that nudge each anchor to fit the real object. This turns an open-ended “draw a box” problem into a bounded “adjust these templates” problem.

Non-Maximum Suppression (NMS) cleans up the result. A confident object fires several overlapping boxes; NMS keeps the highest-scoring box and deletes any other box overlapping it too much (by IoU, below). The algorithm is dead simple:

def nms(boxes, scores, iou_thresh=0.5):
    order = scores.argsort()[::-1].tolist()   # high score first
    keep = []
    while order:
        i = order.pop(0)                       # take best remaining box
        keep.append(i)
        order = [j for j in order if iou(boxes[i], boxes[j]) < iou_thresh]
    return keep                                # survivors = final detections

Running a real detector is now a two-liner with Ultralytics YOLO — the practical default for most detection projects:

from ultralytics import YOLO
model = YOLO("yolov8n.pt")                  # nano model, pretrained on COCO
results = model("street.jpg")               # runs detection + NMS internally
for box in results[0].boxes:
    cls = model.names[int(box.cls)]
    print(cls, box.conf.item(), box.xywh.tolist())   # label, confidence, (x,y,w,h)

We evaluate detectors with mAP (mean Average Precision): for each class you sweep the confidence threshold, trace a precision–recall curve, take the area under it (the Average Precision), then average over classes. A detection counts as correct only if its box overlaps the true box enough — which brings us to IoU.

The mAP formula is just an average of averages:

\[\text{mAP} = \frac{1}{C}\sum_{c=1}^{C} \text{AP}_c, \qquad \text{AP}_c = \int_0^1 p_c(r)\,dr\]

In words: for each class, measure the area under its precision-against-recall curve (that’s its Average Precision), then take the plain mean of those areas across all \(C\) classes.

Also written: in practice the integral is a finite sum over recall points, \(\text{AP}_c \approx \sum_k (r_k - r_{k-1})\,p_{\text{interp}}(r_k)\), using interpolated precision \(p_{\text{interp}}\).

19.3 — Intersection over Union (IoU): a worked example

Intersection over Union (IoU) is the single most important number in detection and segmentation. It measures how well a predicted box matches a ground-truth box: the area where they overlap divided by the area they jointly cover.

\[\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}\]

In words: of all the area the two boxes touch in total, what fraction do they both cover? All overlap and no slop gives 1; no overlap gives 0.

Also written: \(\text{IoU}(A,B) = \dfrac{|A \cap B|}{|A \cup B|} = \dfrac{|A \cap B|}{|A| + |B| - |A \cap B|}\).

IoU is 0 for boxes that don’t touch and 1 for a perfect match. A detection is usually called a true positive if IoU ≥ 0.5 against a ground-truth box of the right class.

Let’s compute one concretely. Predicted box A spans corners \((1,1)\) to \((4,4)\); ground-truth box B spans \((2,2)\) to \((5,5)\). Both are \(3 \times 3 = 9\) in area.

The overlap region runs from \((2,2)\) to \((4,4)\) — a \(2 \times 2\) square, so intersection = 4. The union is both areas minus the double-counted overlap: \(9 + 9 - 4 = 14\). Therefore:

\[\text{IoU} = \frac{4}{14} \approx 0.286\]

That is below 0.5, so this prediction would be scored a false positive despite clearly being close. Here it is in code:

def iou(a, b):
    ax1, ay1, ax2, ay2 = a
    bx1, by1, bx2, by2 = b
    ix1, iy1 = max(ax1, bx1), max(ay1, by1)        # intersection corners
    ix2, iy2 = min(ax2, bx2), min(ay2, by2)
    iw, ih = max(0, ix2 - ix1), max(0, iy2 - iy1)  # clamp at 0 if no overlap
    inter = iw * ih
    union = (ax2-ax1)*(ay2-ay1) + (bx2-bx1)*(by2-by1) - inter
    return inter / union if union else 0.0

assert abs(iou((1,1,4,4), (2,2,5,5)) - 4/14) < 1e-9   # 0.286, our worked case
assert iou((0,0,2,2), (5,5,7,7)) == 0.0               # disjoint -> 0

Warning

The intersection is not just “overlap width × height” eyeballed — you must take the max of the left/top corners and the min of the right/bottom corners, then clamp negative dimensions to zero. Forgetting the clamp gives a positive “intersection” for boxes that don’t even touch, silently corrupting your mAP.

A subtle limit of plain IoU. If two boxes don’t overlap at all, IoU is flatly 0 — whether they’re a hair apart or on opposite sides of the image. That gives a gradient-based loss nothing to push on. Generalized IoU (GIoU) fixes this by subtracting a penalty for the empty space in the smallest box that encloses both, so the metric keeps decreasing as the boxes drift apart:

\[\text{GIoU} = \text{IoU} - \frac{|C \setminus (A \cup B)|}{|C|}\]

In words: start from IoU, then dock points for how much of the tightest enclosing box \(C\) is wasted gap that neither box fills — so even non-overlapping boxes get a meaningful, improvable score.

Also written: \(\text{GIoU} = \text{IoU} - \dfrac{|C| - |A \cup B|}{|C|}\), ranging in \([-1, 1]\) rather than IoU’s \([0,1]\).

This is the box-error term used inside modern detection losses (including the DETR matcher in §19.9).

19.4 — Image Segmentation

Segmentation is the most precise rung: instead of a box, it labels every pixel. There are three flavors, and the distinction matters.

Semantic segmentation labels each pixel with a class — every “car” pixel gets the same “car” label. It does not separate two cars touching each other; they merge into one blob of “car.”
Instance segmentation labels each pixel with a class and an object identity — car #1, car #2, car #3 are distinct masks. It typically ignores amorphous “stuff” like sky or road.
Panoptic segmentation unifies both: every pixel gets a class, and countable “things” (cars, people) also get instance IDs, while “stuff” (sky, grass) just gets a class. It is the complete pixel-level scene description.

flowchart LR
    A[Pixels] --> S[Semantic<br/>class per pixel<br/>cars merge]
    A --> I[Instance<br/>separate objects<br/>car#1, car#2]
    S --> P[Panoptic<br/>things get IDs<br/>+ stuff gets class]
    I --> P

U-Net is the classic semantic-segmentation architecture, born in biomedical imaging. It is an encoder–decoder: the encoder downsamples the image to capture what is present (losing spatial detail), the decoder upsamples back to full resolution to recover where. Its signature feature is skip connections that copy high-resolution features from encoder to decoder, so fine boundaries aren’t lost in the bottleneck — giving the network its U shape.

Mask R-CNN is the classic instance-segmentation model. It extends Faster R-CNN (from §19.2): for every detected box it adds a small extra branch that predicts a binary mask — a pixel-level “object / not object” map inside that box. So you get detection (box + class) and a precise mask per instance, which is exactly what instance segmentation needs.

Segmentation quality is also scored with IoU, but now over pixel sets rather than box areas — often reported as mean IoU (mIoU) across classes. The mechanics are identical to §19.3: count the pixels both masks agree on (intersection), divide by the pixels either mask claims (union). A tiny worked example:

import numpy as np
pred = np.array([[1,1,0],     # 1 = "car" pixel, 0 = background
                 [1,1,0],
                 [0,0,0]])
true = np.array([[1,1,0],
                 [1,0,0],
                 [0,1,0]])
inter = np.logical_and(pred==1, true==1).sum()   # pixels both call "car" -> 3
union = np.logical_or (pred==1, true==1).sum()    # pixels either calls "car" -> 5
print("pixel IoU =", inter/union)                  # 0.6
assert inter == 3 and union == 5

Tip

Pick the lightest task that solves your problem. If you only need to know whether a defect exists, classify. If you need to count objects, detect. Only reach for segmentation when you genuinely need the pixel outline (medical contours, photo background removal, autonomous-driving free-space) — it is the most expensive to label and train.

Dice loss and the class-imbalance problem

Imagine the object you want to outline is a tiny tumor filling 1% of a scan. A model that lazily predicts “background everywhere” is 99% pixel-accurate while being completely useless. This is the class-imbalance trap of segmentation, and it’s why per-pixel accuracy is a misleading score and plain cross-entropy can be a misleading loss.

The fix is to optimize overlap directly. The Dice coefficient is twice the intersection over the sum of the two areas (closely related to IoU), and Dice loss is simply one minus it:

\[\mathcal{L}_{\text{Dice}} = 1 - \frac{2\sum_i p_i\,g_i}{\sum_i p_i + \sum_i g_i}\]

In words: reward the model for the pixels where its predicted mask \(p\) and the ground-truth mask \(g\) both light up, scaled so that a perfect overlap drives the loss to 0 — and crucially, empty background pixels don’t inflate the score.

Also written: \(\mathcal{L}_{\text{Dice}} = 1 - \dfrac{2|P \cap G|}{|P| + |G|}\), with \(P,G\) the predicted and true pixel sets; the Dice score itself relates to IoU as \(\text{Dice} = \dfrac{2\,\text{IoU}}{1 + \text{IoU}}\).

import torch
def dice_loss(pred, target, eps=1e-6):       # pred: probabilities in [0,1]
    pred, target = pred.flatten(), target.flatten()
    inter = (pred * target).sum()
    return 1 - (2*inter + eps) / (pred.sum() + target.sum() + eps)

p = torch.tensor([0.9, 0.8, 0.1, 0.05])      # predicted foreground probs
g = torch.tensor([1.0, 1.0, 0.0, 0.0])       # ground truth
print(dice_loss(p, g).item())                 # small loss: good overlap

In medical and other imbalanced segmentation, Dice loss (often combined with cross-entropy) is the standard choice precisely because it ignores the easy ocean of background and focuses on the overlap you actually care about.

19.5 — OCR, Pose Estimation & Tracking

Three workhorse tasks turn the core three into real applications.

OCR (Optical Character Recognition) converts text in an image — a scanned page, a street sign, a receipt — into machine-readable characters. Modern OCR is a two-step pipeline: a text-detection stage (often a segmentation/detection model) finds where the text regions are, then a text-recognition stage reads each region into a string, usually with a CNN that extracts features followed by a sequence model decoding characters left-to-right. The standard training loss for the unsegmented “image strip → string” mapping is CTC (Connectionist Temporal Classification), which handles the fact that you don’t know in advance which pixel column produced which character — it lets the model emit a character at any column, then collapses repeats and blanks into the final string.

The CTC idea in one line: the loss sums the probability of every alignment that collapses to the target string.

\[p(\mathbf{y}\mid X) = \sum_{\pi \in \mathcal{B}^{-1}(\mathbf{y})} \prod_t p(\pi_t \mid X)\]

In words: to read “CAT”, add up the probabilities of all the per-column emission paths — like C-CA-AT, CC-AAT, -CAT- — that collapse (remove blanks and repeats) to “CAT”.

Also written: \(\mathcal{L}_{\text{CTC}} = -\log p(\mathbf{y}\mid X)\), where \(\mathcal{B}\) is the collapse function mapping a per-frame path \(\pi\) to a deduplicated string.

Pose estimation locates an object’s keypoints — for a human body, joints like wrists, elbows, knees — and connects them into a skeleton. The dominant approach predicts a heatmap per keypoint: a low-resolution probability map whose peak marks where that joint most likely is. Top-down methods detect each person first, then find keypoints inside each box; bottom-up methods find all keypoints in the image first, then group them into individuals. Pose powers fitness apps, motion capture, gesture interfaces, and sports analytics.

Object tracking extends detection across time in a video: it assigns a persistent ID to each object so “car #7” stays “car #7” frame after frame, even through brief occlusion. The classic recipe is tracking-by-detection: run a detector every frame, then a tracker (e.g. SORT / DeepSORT) links this frame’s boxes to the previous frame’s tracks. Linking uses motion prediction (a Kalman filter guesses where each track moved) plus appearance matching (do the pixels/embeddings look like the same object?), solved as an assignment problem via IoU and feature similarity.

flowchart LR
    V[Video frames] --> D[Detect objects<br/>per frame]
    D --> P[Predict track motion<br/>Kalman filter]
    P --> M[Match detections↔tracks<br/>IoU + appearance]
    M --> U[Update IDs<br/>car#7 stays car#7]
    U --> D

The matching step at the heart of a tracker is exactly the assignment problem — link each detection to the existing track it best continues, minimizing total cost:

import numpy as np
from scipy.optimize import linear_sum_assignment
# cost[t, d] = 1 - IoU(track t's predicted box, detection d's box)
cost = np.array([[0.1, 0.8, 0.9],   # track 0 best matches detection 0
                 [0.7, 0.2, 0.85],  # track 1 best matches detection 1
                 [0.9, 0.75, 0.15]])# track 2 best matches detection 2
tracks, dets = linear_sum_assignment(cost)   # Hungarian: same algorithm DETR uses
for t, d in zip(tracks, dets):
    print(f"track {t} -> detection {d}  (cost {cost[t, d]:.2f})")

Warning

Tracking-by-detection inherits the detector’s mistakes. A single missed detection can break a track in two and hand the same object a brand-new ID — an ID switch — and crossing objects often swap IDs when only motion (IoU) is used to match. This is why appearance embeddings (DeepSORT) help through occlusions: when two boxes are equally plausible by position, “which one looks like car #7?” breaks the tie. If your track IDs flicker, the fix is usually a stronger detector or better appearance features, not a fancier matcher.

19.6 — Data Augmentation for Vision

Vision models are hungry for labelled data, which is expensive. Data augmentation manufactures extra training variety for free by applying label-preserving transformations to existing images — a flipped, brightened, slightly rotated cat is still a cat. This teaches the model invariance to nuisances it shouldn’t care about (lighting, position, scale) and is one of the cheapest, most reliable ways to cut overfitting.

The standard toolbox, from tame to aggressive:

Transform	What it does	Why it helps
Horizontal flip	Mirror left↔︎right	Object identity is flip-invariant (usually)
Random crop / resize	Zoom into a sub-region	Robustness to scale and framing
Rotation / shift	Small angle / translation	Position invariance
Color jitter	Perturb brightness, contrast, hue	Robustness to lighting/camera
Cutout	Erase a random patch	Forces use of multiple cues, not one
Mixup	Blend two images and their labels	Smoother decision boundaries
CutMix	Paste a patch of image B into image A	Combines cutout + mixup benefits

A subtle but critical rule: augment only the training set, never validation/test — those must reflect real, untouched data. And the transform must respect the label. For classification a flip is free; but for detection and segmentation you must transform the boxes/masks too — flip the image and the bounding-box coordinates have to flip with it, or you’ve just taught the model garbage.

import numpy as np
def hflip(img, boxes):
    img = img[:, ::-1]                          # mirror pixels horizontally
    W = img.shape[1]
    flipped = boxes.copy().astype(float)
    flipped[:, [0, 2]] = W - boxes[:, [2, 0]]   # x1,x2 -> W-x2, W-x1
    return img, flipped

img = np.zeros((4, 6))                           # 4 rows x 6 cols
b = np.array([[1, 1, 3, 3]])                     # box x1,y1,x2,y2
_, fb = hflip(img, b)
assert (fb == np.array([[3, 1, 5, 3]])).all()    # x flips: 1->5, 3->3; y unchanged

In practice, use a library so image and labels transform together and you don’t hand-roll geometry. Albumentations is the de facto standard and handles boxes/masks automatically:

import albumentations as A
transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
    A.ShiftScaleRotate(shift_limit=0.06, scale_limit=0.1, rotate_limit=15, p=0.5),
], bbox_params=A.BboxParams(format="pascal_voc", label_fields=["labels"]))

out = transform(image=img, bboxes=[[1, 1, 3, 3]], labels=["cat"])
out["image"], out["bboxes"]    # image AND boxes are transformed consistently

The Mixup trick is worth seeing as a formula, because it blends labels too — not just pixels:

\[\tilde{x} = \lambda x_a + (1-\lambda) x_b, \qquad \tilde{y} = \lambda y_a + (1-\lambda) y_b\]

In words: take a weighted average of two images and the same weighted average of their one-hot labels, so a 70%-cat/30%-dog blend is trained toward a 70/30 label rather than a hard choice.

Also written: with mixing weight \(\lambda \sim \text{Beta}(\alpha, \alpha)\), both the input and the target are convex combinations of the two examples.

Warning

The classic augmentation bug is flipping the image but forgetting the label geometry. A horizontally flipped photo with un-flipped bounding boxes trains the detector to point at empty space. Always run augmentation through a library (Albumentations, torchvision transforms v2) that transforms image and targets together — or carefully pair them yourself, as above.

19.7 — Camera Models and Multi-View Geometry

Every image is a flattening. A camera takes a three-dimensional world and squashes it onto a flat sensor, and in doing so it throws away depth. Multi-view geometry is the study of how that flattening works mathematically — and how, by combining several flattened views, we can claw the lost third dimension back.

The pinhole camera

The simplest useful model is the pinhole camera: imagine a sealed box with a tiny hole on one face. Light from a point in the world travels in a straight line through the hole and lands on the opposite wall, the image plane. Because rays travel straight, the geometry is just similar triangles.

Place the pinhole (the camera center) at the origin, looking down the \(z\)-axis. A world point \((X, Y, Z)\) projects to the image plane at distance \(f\) (the focal length) as

\[x = f\frac{X}{Z}, \qquad y = f\frac{Y}{Z}.\]

In words: an object’s image position is its world position scaled down by how far away it is — twice as far means half as big on the sensor.

Also written: in homogeneous form \(\begin{bmatrix} x \\ y \\ 1\end{bmatrix} \sim \begin{bmatrix} f & 0 & 0 \\ 0 & f & 0 \\ 0 & 0 & 1\end{bmatrix}\begin{bmatrix} X \\ Y \\ Z\end{bmatrix}\), where \(\sim\) means “equal after dividing by the last entry.”

That division by \(Z\) is the entire reason depth is lost: double the distance and halve the size, and the projected point is identical. A toy car at \(1\text{ m}\) and a real car at \(20\text{ m}\) can paint the exact same pixels.

Intrinsics: from metric to pixels

The projection above lands in metric units on a plane centered on the optical axis. Real sensors index pixels from a corner, with possibly non-square pixels. The intrinsic matrix \(K\) bundles these conversions:

\[K = \begin{bmatrix} f_x & s & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}.\]

In words: \(K\) converts a metric ray direction into a pixel coordinate — scaling by the focal length and shifting the origin from the image center to the corner.

Also written: a pixel is \(u = f_x\,(X/Z) + c_x\) and \(v = f_y\,(Y/Z) + c_y\) (taking skew \(s = 0\)), the per-coordinate expansion of \(K\).

Here \(f_x, f_y\) are focal lengths in pixels (different if pixels aren’t square), \((c_x, c_y)\) is the principal point where the optical axis pierces the sensor (near the image center), and \(s\) is a skew term that is essentially always zero on modern cameras. These parameters are internal to the camera — they don’t change when the camera moves.

Extrinsics: placing the camera in the world

A camera also sits somewhere, pointed somehow. The extrinsics are a rotation \(R\) (a \(3\times3\) orientation) and translation \(t\) (a \(3\)-vector position) that map world coordinates into the camera’s own frame. Stacking intrinsics and extrinsics gives the full \(3 \times 4\) projection matrix, acting on homogeneous coordinates:

\[\lambda \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = K\,[\,R \mid t\,] \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}.\]

In words: first move the world point into the camera’s frame (\(R\) and \(t\)), then turn that into a pixel (\(K\)); the scalar \(\lambda\) is the depth you divide out at the end.

Also written: \(\mathbf{x} \sim P\,\mathbf{X}\) with the single \(3\times4\) projection matrix \(P = K\,[R \mid t]\) and \(\sim\) meaning “up to the scale \(\lambda\).”

The scalar \(\lambda\) is the depth that gets divided out. Homogeneous coordinates are the trick that turns the nonlinear “divide by \(Z\)” into a clean matrix multiply followed by a final normalization (divide the vector by its third entry).

import numpy as np
# tiny projection: a 1m cube corner at (0.5, 0.5, 5) meters
K = np.array([[800, 0, 320],   # fx, 0, cx
              [0, 800, 240],    # 0, fy, cy
              [0,   0,   1]])
R = np.eye(3)                   # camera aligned with world
t = np.zeros(3)                 # camera at origin
Xw = np.array([0.5, 0.5, 5.0])  # world point
x_cam = R @ Xw + t              # into camera frame
uvw = K @ x_cam                 # apply intrinsics
uv = uvw[:2] / uvw[2]           # normalize by depth
print(uv)                       # -> [400. 320.] pixels

The point at \(5\text{ m}\) lands at pixel \((400, 320)\): \(800 \times 0.5/5 = 80\) pixels right of the principal point \(c_x = 320\), exactly as similar triangles predict.

Calibration and lens distortion

Real lenses bend straight lines, especially wide-angle ones — the radial distortion that makes door frames bow outward (barrel) or pinch inward (pincushion). The standard model adds polynomial correction terms in the squared radius \(r^2 = x^2 + y^2\) from the principal point:

\[x_{\text{dist}} = x\,(1 + k_1 r^2 + k_2 r^4 + \cdots).\]

In words: push each point outward (or inward) by an amount that grows with its distance from the image center, which is why distortion is worst at the edges and zero at the middle.

Also written: \(r_{\text{dist}} = r\,(1 + k_1 r^2 + k_2 r^4 + \cdots)\) applied to the radius \(r\) of the undistorted point, equivalently to both \(x\) and \(y\) components.

Calibration is the process of recovering \(K\) and the distortion coefficients \(k_1, k_2, \dots\). The classic recipe (Zhang’s method) photographs a checkerboard of known square size from many angles; because we know the true geometry of the board, we can solve for the camera parameters that best explain where its corners landed. OpenCV’s calibrateCamera does exactly this in a few lines.

import cv2, numpy as np
# object points: the checkerboard's true 3D corner grid (z=0 plane), one set per image
objp = np.zeros((6*9, 3), np.float32)
objp[:, :2] = np.mgrid[0:9, 0:6].T.reshape(-1, 2)
obj_points, img_points = [], []
for gray in calibration_images_grayscale:          # many views of the board
    found, corners = cv2.findChessboardCorners(gray, (9, 6))
    if found:
        obj_points.append(objp)
        img_points.append(corners)
ret, K, dist, rvecs, tvecs = cv2.calibrateCamera(
    obj_points, img_points, gray.shape[::-1], None, None)
print("intrinsics K =\n", K)        # recovered fx, fy, cx, cy
print("distortion  =", dist.ravel())# recovered k1, k2, p1, p2, k3

Warning

Skipping calibration silently poisons every downstream 3D task. Uncorrected radial distortion can shift a corner by tens of pixels at the image edge, which translates into large depth and pose errors in stereo and SfM. Calibrate once per lens/zoom setting and reuse it.

Two views: epipolar geometry

Point one camera at a scene, then a second camera from a different spot. A single point in the left image could have come from anywhere along its viewing ray — but that whole ray, seen by the right camera, projects to a line. This is the epipolar constraint: the match for a left-image point must lie on a specific line in the right image, never the whole 2D plane. It collapses matching from a 2D search to a 1D one.

Algebraically, for calibrated cameras the constraint is captured by the essential matrix \(E\), and for uncalibrated ones by the fundamental matrix \(F\):

\[x'^\top F\, x = 0,\]

In words: a true pair of matching points, plugged into this equation, gives exactly zero — which is just the algebra of “the match lies on the epipolar line \(F x\).”

Also written: equivalently \(x'^\top (F x) = 0\), i.e. the right point \(x'\) lies on the line \(\ell' = F x\); for calibrated cameras the same holds with \(E\) on normalized coordinates.

where \(x\) and \(x'\) are corresponding points in homogeneous pixel coordinates. Given eight or more matches you can solve for \(F\) linearly (the eight-point algorithm), then decompose \(E = K'^\top F K\) to recover the relative rotation and translation between the two cameras.

flowchart LR
  A[World point] --> B[Left image: point x]
  A --> C[Right image: point x']
  B -. "epipolar line in right view" .-> C
  B --> D["x'ᵀ F x = 0<br/>(must hold)"]
  C --> D

Stereo and depth

If the two cameras are mounted side by side with parallel optical axes — a rectified stereo rig — the epipolar lines become horizontal scanlines, and a matching point simply shifts left or right by an amount called the disparity \(d\). Disparity is inversely proportional to depth:

\[Z = \frac{f \cdot B}{d},\]

In words: the more a point jumps between the left and right views, the closer it is; far-away points barely shift, so disparity and depth trade off inversely.

Also written: \(d = \dfrac{f B}{Z}\) — the same relation solved for disparity instead of depth.

where \(B\) is the baseline (distance between the cameras) and \(f\) the focal length in pixels. Near objects shift a lot between the two views; far objects barely move — exactly why your two eyes sense depth, and why the moon seems to follow the car.

Worked example: cameras with \(f = 800\text{ px}\) and baseline \(B = 0.1\text{ m}\). A point with disparity \(d = 16\text{ px}\) sits at \(Z = 800 \times 0.1 / 16 = 5\text{ m}\). Halve the disparity to \(8\text{ px}\) and the depth doubles to \(10\text{ m}\) — note that the same one-pixel matching error costs far more depth accuracy far away, the fundamental weakness of stereo at range.

Disparity \(d\) (px)	Depth \(Z\) (m)	Depth error per ±1 px
80	1.0	±0.01
16	5.0	±0.31
8	10.0	±1.25
4	20.0	±5.0

Structure from Motion

Structure from Motion (SfM) generalizes stereo to many uncalibrated images — a tourist’s photos of a cathedral, a drone’s overflight — and recovers both the 3D scene and every camera pose at once. The pipeline:

flowchart LR
  A[Many images] --> B[Detect + match<br/>features SIFT]
  B --> C[Estimate pairwise<br/>geometry F/E]
  C --> D[Triangulate points,<br/>chain camera poses]
  D --> E[Bundle adjustment:<br/>jointly refine all]
  E --> F[Sparse 3D point cloud<br/>+ camera poses]

The crucial final stage is bundle adjustment: a big nonlinear least-squares optimization that jiggles every 3D point and every camera pose simultaneously to minimize reprojection error — the pixel gap between where each 3D point lands when projected and where its feature was actually observed. Formally it minimizes

\[\sum_{i,j} \left\lVert\, \text{proj}(C_i, P_j) - x_{ij} \,\right\rVert^2\]

In words: for every camera \(i\) and every 3D point \(j\) it can see, project the point and measure the pixel distance to where the feature was actually detected; add up all those squared gaps and shrink the total.

Also written: \(\min_{\{C_i\},\{P_j\}} \sum_{i,j} v_{ij}\,\lVert \pi(C_i, P_j) - x_{ij}\rVert^2\), where \(\pi\) is the projection function and \(v_{ij}\in\{0,1\}\) flags whether point \(j\) is visible in camera \(i\).

over all cameras \(C_i\) and points \(P_j\). It is the workhorse behind tools like COLMAP and photogrammetry apps, and the front end that feeds the neural reconstruction methods in the next section.

Tip

Reprojection error is the universal sanity check in geometric vision. If your calibration, pose, or triangulation is good, projected points land within a pixel or two of their observations. A reprojection error of tens of pixels means something upstream is wrong — bad matches, bad calibration, or a mirror-image pose from an \(E\)-decomposition sign flip.

19.8 — 3D and Video Understanding

The previous section recovered geometry as points — sparse clouds and camera poses. But we often want something richer: a model you can render from a brand-new viewpoint, or an understanding of how a scene changes through time. This section covers two frontiers: neural 3D scene representations, and models that reason about video.

From point clouds to neural fields

Classical reconstruction gives you discrete primitives — points, meshes, voxels. The neural approach instead represents a scene as a continuous function: feed in a 3D location (and maybe a viewing direction), get back what’s there. The scene lives in the weights of a small neural network. This is a neural implicit representation or neural field.

NeRF: scenes as radiance fields

A Neural Radiance Field (NeRF) is the breakout example. It trains a small MLP to map a 5D input — position \((x, y, z)\) plus viewing direction \((\theta, \phi)\) — to a color \((r, g, b)\) and a volume density \(\sigma\) (how much light that point blocks or emits):

\[F_\Theta : (x, y, z, \theta, \phi) \;\longmapsto\; (r, g, b, \sigma).\]

In words: a single small network answers, for any point in space seen from any direction, “what color is here and how solid is it?”

Also written: \(F_\Theta(\mathbf{p}, \mathbf{d}) = (\mathbf{c}, \sigma)\) with position \(\mathbf{p}\in\mathbb{R}^3\), unit view direction \(\mathbf{d}\), color \(\mathbf{c}=(r,g,b)\), and density \(\sigma\).

To render one pixel, picture looking into fog along a straight line of sight. You shoot a ray from the camera through that pixel and take a series of small steps along it (ray marching), asking the network at each step “what color and how solid is it here?” Then you blend those steps front-to-back: the nearest solid stuff you hit gets the most say, and anything hidden behind it counts for almost nothing — exactly how the foreground of fog hides the background. That blend is volume rendering. Each step’s contribution is its color, dimmed by two things: how opaque the step itself is, and how much light still made it that far without being blocked (the transmittance \(T_i\)):

\[C(\mathbf{r}) = \sum_i T_i\,\big(1 - e^{-\sigma_i \delta_i}\big)\,c_i, \qquad T_i = e^{-\sum_{j<i}\sigma_j \delta_j},\]

In words: the pixel’s color is the sum of each sample’s color, weighted by how opaque that sample is and by how much light survived all the stuff in front of it.

Also written: with per-sample opacity \(\alpha_i = 1 - e^{-\sigma_i\delta_i}\) and accumulated transmittance \(T_i = \prod_{j<i}(1-\alpha_j)\), this is the standard front-to-back alpha-compositing sum \(C = \sum_i T_i \alpha_i c_i\).

where \(\delta_i\) is the spacing between samples. The whole pipeline is differentiable, so the only training signal needed is a set of photos with known poses: render each pixel, compare to the real photo, backpropagate into the MLP weights.

flowchart LR
  A[Camera ray<br/>per pixel] --> B[Sample N points<br/>along ray]
  B --> C["MLP F_Θ:<br/>(x,y,z,dir) → (rgb, σ)"]
  C --> D[Volume render:<br/>composite samples]
  D --> E[Predicted pixel color]
  E --> F[Loss vs. real photo]
  F -. backprop .-> C

Two details make NeRF actually work. Positional encoding maps the raw coordinates through high-frequency sinusoids before the MLP, because a plain network is biased toward smooth functions and would blur fine detail. And conditioning color on viewing direction lets NeRF capture view-dependent effects like specular highlights and reflections — the glint that moves as you walk around a teapot.

Warning

Vanilla NeRF is slow and rigid: training can take hours per scene, rendering is far from real-time, and the model captures one static scene under fixed lighting. It does not generalize across scenes out of the box. Follow-ups (Instant-NGP’s hash grids, Mip-NeRF, dynamic and generalizable variants) chip away at each limitation, but “a NeRF” usually means “retrained per scene.”

3D Gaussian Splatting

3D Gaussian Splatting (3DGS, 2023) attacks NeRF’s speed problem by ditching the implicit MLP for an explicit set of primitives. The scene is represented as millions of tiny 3D Gaussian blobs, each with a position, a covariance (its shape and orientation as a fuzzy ellipsoid), an opacity, and a view-dependent color. Rendering doesn’t march rays — it splats: each Gaussian is projected onto the image and the overlapping blobs are alpha-blended, an operation GPUs do extremely fast.

The payoff is dramatic: comparable or better visual quality than NeRF, but rendering at real-time frame rates and much faster training. Because the primitives are explicit, they’re also easier to edit and animate than weights buried in an MLP.

	NeRF	3D Gaussian Splatting
Representation	Implicit MLP weights	Explicit 3D Gaussians
Rendering	Ray marching (slow)	Rasterize/splat (real-time)
Training speed	Hours	Minutes
Editing	Hard (opaque weights)	Easier (explicit blobs)
Memory	Tiny network	Large blob set

Both NeRF and 3DGS typically take the camera poses from an SfM run (Section 19.7) as their starting point — geometry feeds the neural reconstruction, exactly as foreshadowed.

Understanding video: adding the time axis

A video is a stack of frames — a \((T \times H \times W \times 3)\) tensor — and the central new challenge is motion. A single frame can tell a “sitting” from a “standing” pose, but distinguishing “sitting down” from “standing up” needs the temporal order. Three families of architectures handle this.

3D CNNs extend convolution into time. Where a 2D kernel slides over \((H, W)\), a 3D convolution kernel slides over \((T, H, W)\), learning spatiotemporal patterns — an edge that moves, a hand that opens — in one operation. C3D and I3D (“inflated” 2D ImageNet filters into 3D) are the classics. The cost is real: a \(3\times3\times3\) kernel has \(27\) weights versus \(9\), and the input is a whole clip, so compute and memory balloon.

Two-stream networks split appearance from motion explicitly. One CNN stream sees the raw RGB frames (what things look like); a second stream sees pre-computed optical flow (the per-pixel motion field — how things move). Their predictions are fused at the end. This cleanly separates the two cues and was a strong early approach, at the cost of computing optical flow as a preprocessing step.

flowchart TB
  subgraph TwoStream[Two-stream network]
    A[RGB frame] --> B[Spatial CNN<br/>appearance]
    C[Optical flow stack] --> D[Temporal CNN<br/>motion]
    B --> E[Fuse]
    D --> E
    E --> F[Action label]
  end

Video transformers (ViViT, TimeSformer, VideoMAE) tokenize the clip into spatiotemporal patches — little cubes of pixels across a few frames — and apply self-attention. The key trick is factorized attention: attending over all space-time patches at once is quadratic and ruinously expensive, so these models alternate spatial attention (within a frame) and temporal attention (across frames at the same location), slashing the cost while still mixing information across the whole clip. They now dominate action-recognition benchmarks, mirroring the image-classification story from earlier in the chapter.

Tip

The recurring tension in video is the accuracy–compute tradeoff. Processing every frame at full resolution is wasteful because adjacent frames are nearly identical. Practical systems sample sparse frames or clips, use lower spatial resolution, or factorize attention. When a video model feels too slow, the first lever is almost always how many frames you actually feed it.

19.9 — Transformer-Based Detection and Panoptic Segmentation

The detection methods earlier in this chapter — R-CNN, YOLO, SSD — all lean on hand-designed machinery: a dense grid of anchor boxes, plus non-maximum suppression (NMS) afterward to delete the flood of duplicate detections each object attracts. It works, but it’s a pile of heuristics with knobs to tune. Transformer-based detection asks: what if the network just outputs the final set of objects directly, no anchors, no NMS?

DETR: detection as set prediction

DETR (DEtection TRansformer, 2020) reframes detection as set prediction. A CNN backbone extracts image features; a transformer encoder-decoder processes them; and a fixed number of learned object queries (say 100) each emerge from the decoder as one prediction — a box and a class, where “no object” (\(\varnothing\)) is a valid class. The output is the final answer. No grid of anchors, no NMS.

flowchart LR
  A[Image] --> B[CNN backbone]
  B --> C[Transformer<br/>encoder]
  C --> D[Transformer decoder<br/>+ N object queries]
  D --> E[N predictions:<br/>box + class each]
  E --> F[Bipartite matching<br/>to ground truth]
  F --> G[Loss]

The key idea: bipartite matching

How do you train a model that emits an unordered set of 100 predictions against, say, 3 ground-truth objects? You can’t compare prediction #7 to “the dog” by position, because there’s no fixed ordering. DETR’s answer is bipartite matching: find the one-to-one assignment between predictions and ground-truth objects that minimizes total cost, using the Hungarian algorithm. Each ground-truth object gets matched to exactly one prediction; every unmatched prediction is trained to say “no object.”

The matching objective is a single argmin over all possible one-to-one pairings:

\[\hat{\sigma} = \arg\min_{\sigma \in \mathfrak{S}_N} \sum_{i=1}^{N} \mathcal{L}_{\text{match}}\big(y_i,\ \hat{y}_{\sigma(i)}\big)\]

In words: consider every way to pair each ground-truth object with one prediction, and keep the pairing whose total mismatch cost (wrong class plus badly-placed box) is smallest.

Also written: \(\hat\sigma\) is the optimal permutation, found by the Hungarian algorithm on the \(N\times N\) cost matrix in polynomial time rather than checking all \(N!\) pairings.

This matching is why DETR needs no NMS. Because the loss permits only one prediction per object, the model is actively trained not to produce duplicates — the deduplication that NMS used to do by hand is now baked into the objective.

Tiny worked example. Suppose the model emits 3 predictions \(P_1, P_2, P_3\) and there are 2 ground-truth boxes \(G_1, G_2\) (plus \(\varnothing\)). The matcher builds a cost matrix combining class mismatch and box error:

	\(G_1\) (cat)	\(G_2\) (dog)
\(P_1\)	0.2	0.9
\(P_2\)	0.8	0.3
\(P_3\)	0.7	0.6

The Hungarian algorithm picks the minimum-total assignment: \(P_1 \to G_1\) (0.2) and \(P_2 \to G_2\) (0.3), total cost \(0.5\). \(P_3\) is left over and supervised toward \(\varnothing\). The matching cost typically blends a class term with a box term (an \(L_1\) distance plus the generalized IoU introduced earlier), so a prediction is rewarded for being both the right class and well-localized.

import numpy as np
from scipy.optimize import linear_sum_assignment
# cost[i,j] = cost of matching prediction i to ground-truth j
cost = np.array([[0.2, 0.9],
                 [0.8, 0.3],
                 [0.7, 0.6]])
rows, cols = linear_sum_assignment(cost)   # Hungarian
for r, c in zip(rows, cols):
    print(f"pred {r} -> gt {c}  (cost {cost[r,c]})")
# pred 0 -> gt 0 (0.2);  pred 1 -> gt 1 (0.3);  pred 2 unmatched -> ∅

In practice you rarely implement DETR by hand — a pretrained one runs in a few lines with Hugging Face Transformers, NMS-free out of the box:

from transformers import DetrImageProcessor, DetrForObjectDetection
from PIL import Image
import torch
proc = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50").eval()
image = Image.open("street.jpg")
inputs = proc(images=image, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
# post-process turns 100 object queries into final boxes; no NMS step needed
results = proc.post_process_object_detection(
    outputs, target_sizes=[image.size[::-1]], threshold=0.9)[0]
for label, box in zip(results["labels"], results["boxes"]):
    print(model.config.id2label[label.item()], box.tolist())

DETR’s tradeoffs are honest: it is beautifully simple and end-to-end, but the original was slow to converge (hundreds of training epochs) and weak on small objects. Follow-ups — Deformable DETR (attend to a few key points instead of all pixels), DINO, and others — fixed convergence and accuracy, and the query-based paradigm has since spread well beyond boxes.

Panoptic segmentation: one map for everything

Earlier the chapter drew a line between two segmentation tasks. Semantic segmentation labels every pixel with a class but doesn’t separate instances — all cars share one “car” blob. Instance segmentation (Mask R-CNN) separates individual objects but only the countable “things,” ignoring amorphous background “stuff” like sky, road, or grass.

Panoptic segmentation unifies them: every pixel gets both a class and, for countable things, an instance id. “Stuff” classes get a class label with no instance distinction; “thing” classes get separated into individual instances. One coherent, complete labeling of the image — nothing left unexplained, no pixel double-counted.

Task	Stuff (sky, road)	Things (cars, people)	Per-pixel?
Semantic	classed	classed, not separated	yes
Instance	ignored	separated	no (only things)
Panoptic	classed	separated	yes

Mask2Former: one architecture, all three tasks

The transformer set-prediction idea turns out to be the natural way to do this. MaskFormer and its successor Mask2Former (2022) recast all segmentation as mask classification: instead of labeling each pixel independently, the model predicts a set of \(N\) binary masks, each paired with a class label — exactly the set-prediction framing DETR used for boxes, now producing masks.

Each object query decodes into one mask plus its class. Semantic, instance, and panoptic segmentation then differ only in how you interpret and merge those mask-class pairs at inference — the network and training objective stay identical. Mask2Former’s key efficiency trick is masked attention: each query’s cross-attention is restricted to the region of its own predicted mask rather than the whole image, which both speeds training and sharpens localization.

Tip

The throughline of this whole section: set prediction with bipartite matching is a general recipe, not a detection-only trick. Swap “box” for “mask” and the same machinery does panoptic segmentation; the same idea extends to pose, tracking, and more. One conceptual hammer, many nails — which is exactly why the query-based transformer view reshaped so much of vision after 2020.

19.10 — Promptable Segmentation and Vision Foundation Models

For most of this chapter, a vision model has been a specialist: you train it on your classes, and it knows only those. The recent shift is toward foundation models — single models trained on internet-scale data that handle open-ended inputs, including categories they were never explicitly labeled with. Two ideas dominate.

Open-vocabulary recognition with CLIP

Think of teaching a child by showing pictures next to their captions, never with a fixed multiple-choice answer sheet. CLIP (Contrastive Language–Image Pretraining) learns exactly this way: it trains an image encoder and a text encoder together on hundreds of millions of (image, caption) pairs, so that an image and its true caption land close together in a shared embedding space, while mismatched pairs are pushed apart.

The payoff is zero-shot classification: to classify into any set of categories — even ones invented at test time — you embed the candidate label strings (“a photo of a cat”, “a photo of a dog”), embed the image, and pick the nearest. No retraining, no fixed class list.

import torch, clip
from PIL import Image
model, preprocess = clip.load("ViT-B/32")
image = preprocess(Image.open("pet.jpg")).unsqueeze(0)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a rabbit"]
text = clip.tokenize(labels)
with torch.no_grad():
    logits_per_image, _ = model(image, text)     # similarity image↔each label
    probs = logits_per_image.softmax(dim=-1)
print(dict(zip(labels, probs[0].tolist())))       # zero-shot class probabilities

The training objective is a contrastive loss: across a batch of \(N\) pairs, the correct image–text pairing should score highest among all \(N\) candidates for each image (and each text).

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{e^{\,\text{sim}(I_i, T_i)/\tau}}{\sum_{j=1}^{N} e^{\,\text{sim}(I_i, T_j)/\tau}}\]

In words: for each image, treat its own caption as the right answer among all captions in the batch, and push the model to score that pairing highest — a softmax over similarities, sharpened by temperature \(\tau\).

Also written: this is exactly cross-entropy with the matched index as the label, computed symmetrically over rows (image→text) and columns (text→image) of the \(N\times N\) similarity matrix.

A tiny worked example of the contrastive idea. Say a batch has 2 pairs and the model’s image-to-text similarities (after temperature scaling) are: image 1 scores \([3.0, 0.5]\) against [its own caption, the other caption], and image 2 scores \([0.2, 2.4]\). For image 1 the softmax over \([3.0, 0.5]\) puts \(\approx 0.92\) on the correct caption, so its loss is \(-\log 0.92 \approx 0.08\) — small, because the right pairing already wins. If image 2 had instead scored \([2.0, 1.9]\) (correct caption barely ahead), its softmax would give only \(\approx 0.48\) to the wrong-but-tied answer and push a much larger loss, dragging that true pair closer together on the next step. The loss is large exactly when the matching pair is not clearly the most similar.

CLIP embeddings underpin much of modern multimodal AI (that chapter owns the full story) and are the text-grounding behind open-vocabulary detectors and segmenters.

Promptable segmentation with SAM

The Segment Anything Model (SAM, 2023) does for segmentation what CLIP did for recognition: it is a promptable model trained on a billion masks. Instead of a fixed class list, you give SAM a prompt — a click point, a box, or a rough scribble — and it returns a high-quality mask for whatever object that prompt indicates, including objects and categories it was never told the names of.

flowchart LR
  A[Image] --> B[Heavy image encoder<br/>run once]
  P[Prompt: point / box] --> C[Light mask decoder]
  B --> C
  C --> D[Object mask<br/>any object, no class list]

Its design is what makes it practical: a heavy image encoder runs once per image to produce an embedding, then a lightweight mask decoder turns each prompt into a mask in milliseconds. So an interactive tool can let a user click around an image and get instant masks. SAM is class-agnostic — it outlines the object but doesn’t name it — which is why it’s often paired with CLIP or a detector that supplies the label.

from segment_anything import SamPredictor, sam_model_registry
import numpy as np
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)
predictor.set_image(image_rgb)                 # heavy encoder runs once
masks, scores, _ = predictor.predict(          # cheap per-prompt decode
    point_coords=np.array([[420, 300]]),       # a single click on the object
    point_labels=np.array([1]),                # 1 = foreground
    multimask_output=True)                      # returns a few candidate masks
best = masks[scores.argmax()]                   # boolean (H, W) mask

Tip

The pattern uniting CLIP and SAM is prompting a general model instead of training a narrow one. It mirrors what happened in NLP: a few huge, broadly-trained models, steered at inference by text or clicks, increasingly replace a zoo of task-specific networks. For a new vision problem in 2024+, the lazy-but-strong first move is often “can a foundation model already do this zero-shot?” before you label a single image.

19.11 — Self-Supervised Pretraining for Vision

Transfer learning (§19.1) assumes you have a big labelled dataset like ImageNet to pretrain on. But labels are the expensive part — the internet has billions of images and almost no labels. Self-supervised learning (SSL) sidesteps this: it invents a “pretext” task whose answer is hidden in the image itself, so the model learns rich features from raw pixels with zero human labels, then transfers them to your real task with only a small labelled set.

The intuition: if a model can solve a hard puzzle about an image — “which crop of this photo matches which other crop?” or “what was under this erased patch?” — it must have learned what objects, textures, and parts look like. Those learned features are exactly what a downstream classifier or detector wants.

Two families dominate modern vision SSL.

Contrastive / joint-embedding methods (SimCLR, MoCo, DINO) take one image, make two random augmented views of it (crop, flip, color-jitter), and train the network so the two views of the same image land close in embedding space while views of different images land far apart. It is the same geometric idea as the triplet loss in §19.1, scaled to whole batches. The pretext label — “these two crops are the same photo” — is free.

Masked image modeling (MAE, “Masked Autoencoder”) borrows the trick that made BERT work in NLP: hide a large fraction (MAE masks ~75%) of the image patches, and train the model to reconstruct the missing pixels from the few that remain. To inpaint a cat’s hidden ear, the network must understand cats. It is the visual analogue of fill-in-the-blank.

flowchart LR
  subgraph Contrastive[Contrastive: SimCLR / DINO]
    A[One image] --> V1[View 1: crop+jitter]
    A --> V2[View 2: crop+jitter]
    V1 --> E1[Encoder]
    V2 --> E2[Encoder]
    E1 -.->|pull together| E2
  end
  subgraph Masked[Masked: MAE]
    M[Image] --> MM[Mask 75% of patches]
    MM --> ENC[Encode visible]
    ENC --> DEC[Decode → reconstruct pixels]
  end

The SimCLR contrastive objective for a positive pair \((i, j)\) is the NT-Xent (normalized temperature-scaled cross-entropy) loss — pick the true partner out of all other views in the batch:

\[\mathcal{L}_{i,j} = -\log \frac{e^{\,\text{sim}(z_i, z_j)/\tau}}{\sum_{k \ne i} e^{\,\text{sim}(z_i, z_k)/\tau}}\]

In words: out of every other embedding in the batch, the model must score this view’s true augmentation-partner highest — a softmax over cosine similarities, sharpened by temperature \(\tau\).

Also written: with \(\text{sim}(u,v) = u^\top v / (\lVert u\rVert\lVert v\rVert)\) the cosine similarity; this is the same form as CLIP’s loss (§19.10), except the “positive pair” comes from two augmentations of one image rather than an image and its caption.

import torch, torch.nn.functional as F
def nt_xent(z, temperature=0.5):
    # z: (2N, d) — rows 0..N-1 are view-1, rows N..2N-1 are view-2 of the same images
    z = F.normalize(z, dim=1)
    sim = z @ z.t() / temperature                 # (2N, 2N) cosine-sim matrix
    N = z.shape[0] // 2
    targets = torch.arange(2*N).roll(N)           # partner of i is i±N
    sim.fill_diagonal_(float("-inf"))             # never match a view to itself
    return F.cross_entropy(sim, targets)

z = torch.randn(8, 128)                            # 4 images × 2 views
print(nt_xent(z).item())                           # one scalar loss to minimize

Tip

SSL is why foundation models like DINOv2 and CLIP exist — they are pretrained on web-scale unlabelled (or weakly-labelled) data and then used off-the-shelf. The practical takeaway: before you label thousands of images, check whether a self-supervised backbone (DINOv2, MAE-pretrained ViT) already gives features strong enough that a tiny linear head on top solves your task.

19.12 — Robustness, Adversarial Examples & Explainability

A vision model can hit 99% test accuracy and still be brittle and inscrutable in ways that matter the moment it leaves the lab. Two practical concerns close the loop on this chapter: can the model be fooled? and can we see why it decided what it did?

Adversarial examples

The unsettling fact: for most trained vision models you can take a correctly-classified image, add a perturbation so small a human can’t see it, and flip the prediction to anything you like. These are adversarial examples — the image looks identical, but a “panda” becomes a “gibbon” with high confidence.

The intuition is geometric. A neural net carves the input space into decision regions with high-dimensional, often razor-thin boundaries. In hundreds of thousands of pixel dimensions, the nearest boundary can be astonishingly close even though it’s invisible along any single direction you’d naturally look. The attacker simply walks the image straight toward that boundary by following the loss gradient.

The simplest attack, FGSM (Fast Gradient Sign Method), nudges every pixel by a tiny fixed step in the direction that increases the loss:

\[x_{\text{adv}} = x + \epsilon \cdot \text{sign}\big(\nabla_x \mathcal{L}(\theta, x, y)\big)\]

In words: for each pixel, ask “does brightening or darkening it raise the model’s error?” and move it that way by a tiny amount \(\epsilon\) — a coordinated, imperceptible shove that piles up into a wrong answer.

Also written: the perturbation \(\delta = \epsilon\,\text{sign}(\nabla_x \mathcal{L})\) is the largest-loss step inside an \(\ell_\infty\) ball of radius \(\epsilon\) (no pixel changes by more than \(\epsilon\)); iterating it in small steps gives the stronger PGD attack.

import torch
def fgsm(model, x, y, eps=0.03):
    x = x.clone().detach().requires_grad_(True)
    loss = torch.nn.functional.cross_entropy(model(x), y)
    loss.backward()                                  # gradient w.r.t. the pixels
    x_adv = x + eps * x.grad.sign()                  # step that raises the loss
    return x_adv.clamp(0, 1).detach()                # keep it a valid image

The standard defense is adversarial training: generate adversarial examples on the fly during training and include them in the loss, so the model learns boundaries that don’t bend under small shoves. It is the most reliable defense, at the cost of more compute and usually a small drop in clean accuracy. For safety-critical vision (autonomous driving, medical, security), assume an adversary exists and test against one.

Warning

Adversarial examples are not a lab curiosity. Printed stickers on a stop sign, patterned glasses that fool face recognition, and subtle audio noise have all been demonstrated in the physical world. Robustness is a security property (see AI Ethics, Fairness & Safety), not just an accuracy one — and high clean accuracy tells you nothing about it.

Explainability: seeing what the model looked at

When a model says “tumor” or “stop sign,” you often need to know which pixels drove that call — for trust, for debugging, and for catching the classic failure where a model “detects boats” by actually detecting water. Saliency and class-activation methods (part of the broader toolkit in Explainable AI) produce a heatmap over the image showing where the evidence for a class came from.

Grad-CAM (Gradient-weighted Class Activation Mapping) is the workhorse. The intuition: the last convolutional layer holds feature maps that still have spatial layout (“there’s fur here, an ear there”). Grad-CAM asks how much each feature map raised the score for the target class, then sums the feature maps weighted by that importance — producing a coarse heatmap that lights up the regions the class relied on.

\[L^c_{\text{Grad-CAM}} = \text{ReLU}\!\Big(\sum_k \alpha^c_k\, A^k\Big), \qquad \alpha^c_k = \frac{1}{Z}\sum_{u,v} \frac{\partial y^c}{\partial A^k_{uv}}\]

In words: weight each feature map \(A^k\) by how strongly bumping it up increases the class score \(y^c\) (that weight \(\alpha^c_k\) is the averaged gradient), add them up, and keep only the positive evidence — that’s the heatmap.

Also written: \(\alpha^c_k\) is the global-average-pooled gradient of the class score with respect to feature map \(k\); the \(\text{ReLU}\) discards regions that argue against the class, leaving only what supports it.

import torch, torch.nn.functional as F
# register a hook to grab activations + gradients of the last conv layer
acts, grads = {}, {}
layer = model.layer4[-1]                              # last conv block of a ResNet
layer.register_forward_hook(lambda m, i, o: acts.__setitem__("a", o))
layer.register_full_backward_hook(lambda m, gi, go: grads.__setitem__("g", go[0]))

logits = model(x)                                     # x: (1,3,H,W)
logits[0, logits.argmax()].backward()                 # backprop the top class
weights = grads["g"].mean(dim=(2, 3), keepdim=True)   # α: pooled gradients
cam = F.relu((weights * acts["a"]).sum(1))            # weighted sum + ReLU
cam = F.interpolate(cam[None], size=x.shape[2:], mode="bilinear")[0, 0]
cam = (cam - cam.min()) / (cam.max() + 1e-8)          # normalize to [0,1] heatmap

Overlaying cam on the input shows, in warm colors, the pixels that justified the prediction. It is the fastest way to catch a model that learned the wrong cue — the snow behind the husky, the watermark on the stock photo, the ruler beside the skin lesion.

Tip

Run Grad-CAM (or a saliency map) on a handful of correct and incorrect predictions before trusting any classifier in production. Right answers for the wrong reasons are the bugs that pass every accuracy check and then fail catastrophically on data that lacks the spurious cue.

19.13 — Quick reference

Which vision task should I use? — start from what you actually need out of the model:

flowchart TD
  A[What do you need?] --> B{Just whether/which<br/>category is present?}
  B -->|yes| C[Classification<br/>top-1/top-5]
  B -->|no| D{Need to count or<br/>locate objects?}
  D -->|box is enough| E[Detection<br/>YOLO / Faster R-CNN / DETR]
  D -->|need exact pixels| F{Separate touching<br/>instances?}
  F -->|no, class per pixel| G[Semantic seg<br/>U-Net]
  F -->|yes, per object| H[Instance seg<br/>Mask R-CNN]
  F -->|things + stuff| I[Panoptic seg<br/>Mask2Former]
  A --> J{New task, no labels yet?}
  J -->|yes| K[Try a foundation model<br/>CLIP / SAM zero-shot first]

Term / formula	Meaning	When / why it matters
Softmax \(p_i = e^{z_i}/\sum_j e^{z_j}\)	Turns logits into a probability distribution	Final layer of any classifier; top-1/top-5 read off it
Top-1 / Top-5 accuracy	Top guess correct / true label in top 5	Top-5 forgives fine-grained classes (ImageNet)
Transfer learning	Reuse a pretrained backbone, retrain the head	The default when you have few labels
Triplet loss \([\,d(a,p)-d(a,n)+\alpha\,]_+\)	Pull same-identity close, push others apart	Face recognition, image retrieval embeddings
Bounding box \((x,y,w,h)\)	Rectangle + class + score per object	Output unit of detection
Anchors	Pre-set box templates the model nudges	Lets a grid predict variably-shaped objects
IoU \(=\lvert A\cap B\rvert/\lvert A\cup B\rvert\)	Overlap / union of two boxes or masks	TP if ≥ 0.5; clamp negative overlap to 0
GIoU	IoU minus wasted-gap penalty	Gives a gradient even for non-overlapping boxes
NMS	Drop duplicate overlapping boxes by IoU	Cleanup step for anchor/grid detectors
mAP \(=\frac{1}{C}\sum_c \text{AP}_c\)	Mean area under precision–recall curve	The standard detection metric
Semantic / Instance / Panoptic	Class-per-pixel / per-object / both	Pick by whether you must separate instances + stuff
Dice loss \(1-\frac{2\lvert P\cap G\rvert}{\lvert P\rvert+\lvert G\rvert}\)	Optimize overlap directly	Beats accuracy/CE under class imbalance
CTC loss	Sum over all alignments to the target string	OCR text recognition without per-char alignment
Pinhole \(x=fX/Z\)	World point scaled by inverse depth	Why a flat image loses depth
Stereo \(Z=fB/d\)	Depth from disparity between two views	Closer objects shift more; error grows with range
Bipartite matching	Hungarian assignment of preds ↔︎ ground truth	DETR/Mask2Former: replaces anchors + NMS
CLIP contrastive loss	Match image to its caption in a batch	Zero-shot, open-vocabulary recognition
SAM	Promptable, class-agnostic mask from a click/box	Segment any object without a fixed class list
NT-Xent (SimCLR)	Pull two augmented views of one image together	Self-supervised pretraining, no labels
FGSM \(x+\epsilon\,\text{sign}(\nabla_x\mathcal{L})\)	One-step gradient-sign adversarial nudge	Tests robustness; defend with adversarial training
Grad-CAM	Heatmap of pixels that raised the class score	Catch “right answer, wrong reason” before deploy

19.14 — Key takeaways

A digital image is just a grid of numbers (color = three stacked grids); every vision model is a function from that grid to a structured answer.
Vision tasks form a ladder of spatial precision: classification (one label/image) → detection (box + label/object) → segmentation (label/pixel). Pick the lightest one that solves your problem.
CNNs bring a strong locality inductive bias and are data-efficient; ViTs drop that bias and win with large data. Classification is scored by top-1 and top-5 accuracy.
Transfer learning is the practical default: freeze a pretrained backbone and train a new head when data is scarce, fine-tune at a small learning rate when you have more.
Face recognition maps faces to embeddings trained with triplet loss, enabling open-set recognition by distance comparison.
Detection = what + where. Know the vocabulary: bounding boxes, anchors, IoU, NMS, mAP, two-stage (R-CNN family) vs single-stage (YOLO, SSD).
IoU = intersection / union; ≥ 0.5 is the usual true-positive bar. Always take max-of-corners, min-of-corners, and clamp to zero — our worked box pair scored only 0.286. GIoU extends it to non-overlapping boxes for use as a loss.
Segmentation comes in semantic (class/pixel), instance (separate objects), and panoptic (both). U-Net for semantic, Mask R-CNN for instance; quality is mIoU, and Dice loss beats accuracy when classes are imbalanced.
OCR (CTC-trained text recognition), pose (keypoint heatmaps), and tracking (detection + Kalman/appearance association) are the applied workhorses.
Data augmentation cheaply buys invariance and fights overfitting — apply it to training only, and transform the boxes/masks alongside the image.
Multi-view geometry (pinhole projection, intrinsics/extrinsics, calibration, epipolar geometry, stereo, SfM) recovers 3D from 2D; NeRF and 3D Gaussian Splatting turn posed photos into renderable scenes.
Set prediction with bipartite matching (DETR, Mask2Former) replaces anchors and NMS with one general recipe spanning detection and panoptic segmentation.
Foundation models — CLIP for open-vocabulary recognition, SAM for promptable segmentation — increasingly let you solve new tasks zero-shot instead of training a narrow model.
Self-supervised learning (contrastive: SimCLR/DINO; masked: MAE) learns strong features from unlabelled images, powering modern foundation backbones — try one before labelling a large dataset.
Adversarial examples flip predictions with imperceptible, gradient-aligned noise (FGSM/PGD); high clean accuracy implies nothing about robustness — defend with adversarial training for safety-critical use.
Explainability with Grad-CAM heatmaps reveals which pixels drove a decision — the fastest way to catch “right answer, wrong reason” before deployment.

19.15 — See also

Convolutional Neural Networks — the backbone mechanics (filters, pooling, receptive fields) behind every model here.
Attention & Transformers — how Vision Transformers tokenize and attend over image patches.
Generative Models — image synthesis, GANs and diffusion, and the segmentation-mask-to-image direction.
Neural Networks (Core) — softmax, cross-entropy loss, and the training loop underlying classification.
Recurrent & Sequence Models — the sequence decoders and CTC loss used in OCR text recognition.
Multimodal AI — connecting vision with language (captioning, visual question answering, CLIP).
Model Evaluation & Tuning — precision–recall, the curves that mAP is built on, and threshold selection.

↪ The thread continues → Chapter 20 · 💬 Natural Language Processing

Machines can see; now they must read and write. NLP takes the same deep-learning toolkit to the messy, ambiguous medium of human language.

📖 All chapters | ← 18 · 🎨 Generative Models | 20 · 💬 Natural Language Processing →