🔥 Deep Learning with PyTorch · Lesson 6 — Convolutional Networks: Teaching Models to See

🏠 🔥 Course home | ← Lesson 05 | Lesson 07 → | 📚 All mini-courses

Lesson 6 — Convolutional Networks: Teaching Models to See

In the previous lesson you built a full classification pipeline — model, data, training loop, evaluation — and it worked. But if you fed real images into that MLP, you’d hit a wall: a fully connected layer treats pixel (0, 0) and pixel (31, 31) as unrelated inputs, learns nothing it can reuse across positions, and burns millions of parameters doing it. In this lesson you’ll fix all three problems with one idea — convolution — and train a small CNN on CIFAR-10 that beats any MLP of similar size. Along the way you’ll master the shape arithmetic that trips up everyone (RuntimeError: mat1 and mat2 shapes cannot be multiplied is the CNN rite of passage), wire up torchvision datasets with modern v2 transforms, add data augmentation, and learn to diagnose your training curves like an engineer rather than a spectator.

🎯 In this lesson you will: understand why convolution beats dense layers for images, master the Conv2d/MaxPool2d output-size math, build a compact CNN with correct shapes end to end, load CIFAR-10 with torchvision transforms and augmentation, train the network and read the loss/accuracy curves for overfitting signals

Why convolution — three problems, one operator

If the theory of convolution is fresh for you, the encyclopedia’s entries on convolutional neural networks and translation equivariance cover the derivation in depth. Here is the engineering version.

Take a modest 32×32 RGB image: that’s \(32 \times 32 \times 3 = 3072\) input values. A single fully connected hidden layer with 512 units needs \(3072 \times 512 \approx 1.57\)M weights — and it has three structural flaws:

It ignores locality. Nearby pixels are correlated (edges, textures, corners are local patterns), but a dense layer connects every pixel to every unit with an independent weight. The spatial structure is destroyed the moment you flatten.
It can’t reuse knowledge across positions. An edge detector useful at the top-left of the image must be re-learned, from scratch, for every other location.
It scales terribly. Double the image resolution and the parameter count quadruples.

Convolution solves all three by sliding one small learnable filter (say 3×3) across the image and computing a dot product at every position. The same weights fire everywhere — that’s weight sharing — so a cat’s ear is detected whether it appears left or right (translation equivariance), and the parameter count depends on the kernel size and channel counts, not the image size. A 3×3 filter over 3 input channels costs \(3 \times 3 \times 3 + 1 = 28\) parameters, whether the image is 32×32 or 4096×4096.

Here’s the geometry of one convolution step — a 3×3 kernel sliding over a 5×5 input producing a 3×3 output (no padding, stride 1):

One filter produces one feature map — a 2D grid of “how strongly did my pattern match here?” scores. A Conv2d layer learns many filters in parallel, producing one feature map per filter. Stack layers, and later filters see combinations of earlier feature maps: edges → textures → parts → objects.

The shape math you must be able to do in your head

Every conv/pool layer transforms a tensor of shape \((N, C_{in}, H_{in}, W_{in})\) into \((N, C_{out}, H_{out}, W_{out})\). PyTorch uses channels-first layout — get this wrong (e.g. loading images as H×W×C from NumPy) and Conv2d will either crash or silently treat your image height as channels.

The single formula that governs everything, for kernel size \(k\), padding \(p\), stride \(s\):

\[ H_{out} = \left\lfloor \frac{H_{in} + 2p - k}{s} \right\rfloor + 1 \]

(and identically for width). Three configurations cover 95% of real networks:

Configuration	Formula result	Effect
`kernel_size=3, padding=1, stride=1`	\(H_{out} = H_{in}\)	“same” conv — extract features, keep resolution
`kernel_size=3, padding=1, stride=2`	\(H_{out} = \lceil H_{in}/2 \rceil\)	strided downsampling
`MaxPool2d(kernel_size=2)` (stride defaults to 2)	\(H_{out} = H_{in}/2\)	halve resolution, keep strongest activation

Let’s verify with real tensors rather than trusting the table:

import torch
import torch.nn as nn

x = torch.randn(8, 3, 32, 32)          # batch of 8 RGB images, 32x32

same_conv    = nn.Conv2d(3, 16, kernel_size=3, padding=1)            # "same"
strided_conv = nn.Conv2d(3, 16, kernel_size=3, padding=1, stride=2)  # halves H,W
valid_conv   = nn.Conv2d(3, 16, kernel_size=3)                       # no padding
pool         = nn.MaxPool2d(kernel_size=2)                           # stride=2 implied

print(same_conv(x).shape)     # torch.Size([8, 16, 32, 32])
print(strided_conv(x).shape)  # torch.Size([8, 16, 16, 16])
print(valid_conv(x).shape)    # torch.Size([8, 16, 30, 30])  <- (32-3)/1 + 1 = 30
print(pool(x).shape)          # torch.Size([8, 3, 16, 16])   <- channels untouched

Three things to internalize from this output:

Conv2d(in_channels, out_channels, ...) — the first argument must match the incoming tensor’s channel dim, the second is your choice (how many filters to learn). Each filter spans all input channels: a Conv2d(3, 16, 3) filter is really a 3×3×3 volume, and there are 16 of them.
Pooling has no parameters and doesn’t touch channels. It downsamples each feature map independently. That’s why pool(x) still has 3 channels.
Padding exists to stop your image from shrinking. Without it, every 3×3 conv eats a 1-pixel border; ten of them in a row would erase a 20-pixel margin — most of a CIFAR image.

Parameter count for a conv layer is \(C_{out} \times (C_{in} \times k \times k + 1)\):

n_params = sum(p.numel() for p in same_conv.parameters())
print(n_params)  # 448  = 16 * (3*3*3 + 1)

Compare: 448 parameters versus 1.57M for the dense layer we discussed. That ratio is the whole argument for convolution, in one number.

Finally, nn.Flatten is the bridge from the convolutional world to the linear-classifier world — it collapses \((N, C, H, W)\) into \((N, C \cdot H \cdot W)\), keeping the batch dimension:

feat = torch.randn(8, 64, 8, 8)
flat = nn.Flatten()(feat)
print(flat.shape)  # torch.Size([8, 4096])  <- 64*8*8; this number feeds nn.Linear

If your Linear layer’s in_features doesn’t equal \(C \cdot H \cdot W\) at that point, you get the classic shape mismatch error. In a moment I’ll show you the trick that makes computing it by hand unnecessary.

Building the CNN

Here is the architecture we’ll train — two conv blocks (conv, conv, pool) followed by a small classifier head. It’s the standard shape of every CNN from LeNet to VGG: resolution shrinks as you go deeper, channels grow, so each layer trades where for what.

flowchart LR
    A["input<br/>3×32×32"] --> B["Conv 3→32<br/>+ ReLU"] --> C["Conv 32→32<br/>+ ReLU"] --> D["MaxPool /2<br/>32×16×16"]
    D --> E["Conv 32→64<br/>+ ReLU"] --> F["Conv 64→64<br/>+ ReLU"] --> G["MaxPool /2<br/>64×8×8"]
    G --> H["Flatten<br/>4096"] --> I["Linear 4096→256<br/>+ ReLU + Dropout"] --> J["Linear 256→10<br/>logits"]

class TinyCNN(nn.Module):
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3x32x32 -> 32x16x16
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 32, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 2: 32x16x16 -> 64x8x8
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),   # logits — no softmax (Lesson 5 rule)
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.classifier(self.features(x))

Design decisions worth spelling out:

Two convs per block before pooling. Two stacked 3×3 convs have a 5×5 receptive field but fewer parameters and an extra nonlinearity compared to one 5×5 conv. This is the VGG insight and it’s still the default.
Channels double when resolution halves (32 → 64 across the pool). The tensor’s total volume stays balanced, and deeper layers get more filters to represent more abstract, more varied patterns.
64 * 8 * 8 in the first Linear is the load-bearing number. It comes from tracing shapes: 32×32 → pool → 16×16 → pool → 8×8, with 64 channels at the end. Change anything upstream (input size, a pool, padding) and this breaks.
Dropout only in the classifier head. Dense layers are where this model can memorize; the conv layers are already regularized by weight sharing. (Lesson 7 covers regularization properly.)
No softmax. As established on Lesson 5, nn.CrossEntropyLoss takes raw logits.

Never hand-compute the flatten size when you can ask the network:

model = TinyCNN()

with torch.no_grad():
    dummy = torch.zeros(1, 3, 32, 32)
    print(model.features(dummy).shape)          # torch.Size([1, 64, 8, 8]) -> 4096 ✓

print(sum(p.numel() for p in model.parameters() if p.requires_grad))
# 1_121_450

Also useful whenever a Sequential misbehaves — trace shapes layer by layer:

x = torch.zeros(1, 3, 32, 32)
for layer in model.features:
    x = layer(x)
    print(f"{layer.__class__.__name__:<10} -> {tuple(x.shape)}")

Conv2d     -> (1, 32, 32, 32)
ReLU       -> (1, 32, 32, 32)
Conv2d     -> (1, 32, 32, 32)
ReLU       -> (1, 32, 32, 32)
MaxPool2d  -> (1, 32, 16, 16)
Conv2d     -> (1, 64, 16, 16)
ReLU       -> (1, 64, 16, 16)
Conv2d     -> (1, 64, 16, 16)
ReLU       -> (1, 64, 16, 16)
MaxPool2d  -> (1, 64, 8, 8)

Note where the parameters live: the conv stack — all the actual vision — holds about 66K parameters. The single Linear(4096, 256) holds ~1.05M, 94% of the total. Dense layers on spatial data are expensive even when demoted to a head; that’s why modern architectures replace this head with global average pooling (you’ll meet it on Lesson 8 inside ResNet).

CIFAR-10 with torchvision: transforms and augmentation

Lesson 3 taught you Dataset and DataLoader from scratch. torchvision.datasets gives you the standard benchmarks pre-wrapped in that same interface — CIFAR10 downloads, verifies, and serves 50,000 training + 10,000 test images (32×32 RGB, 10 classes) as PIL images plus integer labels. Your job is only the transform pipeline. We use the modern transforms.v2 API:

import torch
from torchvision import datasets
from torchvision.transforms import v2
from torch.utils.data import DataLoader

CIFAR_MEAN = (0.4914, 0.4822, 0.4465)   # per-channel stats of the train split
CIFAR_STD  = (0.2470, 0.2435, 0.2616)

train_tf = v2.Compose([
    v2.RandomCrop(32, padding=4),        # augmentation: pad to 40x40, crop 32x32
    v2.RandomHorizontalFlip(p=0.5),      # augmentation: mirror half the time
    v2.ToImage(),                        # PIL -> tensor image (uint8, CHW)
    v2.ToDtype(torch.float32, scale=True),  # uint8 [0,255] -> float32 [0,1]
    v2.Normalize(CIFAR_MEAN, CIFAR_STD),
])

test_tf = v2.Compose([                   # NO augmentation at eval time
    v2.ToImage(),
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(CIFAR_MEAN, CIFAR_STD),
])

train_ds = datasets.CIFAR10("data", train=True,  download=True, transform=train_tf)
test_ds  = datasets.CIFAR10("data", train=False, download=True, transform=test_tf)

train_loader = DataLoader(train_ds, batch_size=128, shuffle=True,
                          num_workers=2, pin_memory=True)
test_loader  = DataLoader(test_ds,  batch_size=256, shuffle=False,
                          num_workers=2, pin_memory=True)

xb, yb = next(iter(train_loader))
print(xb.shape, xb.dtype, yb.shape)   # torch.Size([128, 3, 32, 32]) torch.float32 torch.Size([128])
print(xb.mean().item(), xb.std().item())  # ≈ 0.0, ≈ 1.0 — normalization worked

The methodology, line by line:

Order matters. Geometric augmentations (RandomCrop, RandomHorizontalFlip) run on the image first; conversion to float and Normalize come last. Put Normalize before ToDtype(float32) and it will fail on uint8 input.
RandomCrop(32, padding=4) and RandomHorizontalFlip are the canonical CIFAR augmentation pair. Each epoch, the model sees a slightly shifted, possibly mirrored version of every image — effectively a much larger dataset, for free. Augmentation is a label-preserving transformation: a flipped cat is still a cat. (A flipped “6” is not a “6” — don’t blindly copy augmentations across domains; MNIST digits, for instance, must not be flipped.)
Two different pipelines. Augmentation is training-only noise injection. Evaluating on randomly cropped images would make your test accuracy a noisy underestimate. This mirrors the model.train() / model.eval() split from Lesson 4 — same principle, applied to data.
Normalize uses the train-split statistics — computed once, hard-coded, and applied identically to test data. Recomputing stats on the test set is subtle leakage.
Because transforms run inside Dataset.__getitem__, augmentation is re-randomized every epoch — the same index yields a different crop each time it’s drawn. That’s exactly what you want.

Train it and read the curves

The training loop is Lesson 4’s, unchanged — that’s the payoff of writing it properly once. We train twice, with and without augmentation, because the comparison is this lesson’s real lesson.

device = torch.device(
    "cuda" if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available()
    else "cpu"
)

def run_epoch(model, loader, criterion, optimizer=None):
    training = optimizer is not None
    model.train(training)
    total_loss, correct, seen = 0.0, 0, 0
    with torch.set_grad_enabled(training):
        for xb, yb in loader:
            xb, yb = xb.to(device), yb.to(device)
            logits = model(xb)
            loss = criterion(logits, yb)
            if training:
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
            total_loss += loss.item() * xb.size(0)
            correct += (logits.argmax(dim=1) == yb).sum().item()
            seen += xb.size(0)
    return total_loss / seen, correct / seen

One function for both phases: passing an optimizer means “train”, omitting it means “evaluate”. model.train(training) toggles Dropout correctly in both directions, and torch.set_grad_enabled(training) replaces the torch.no_grad() context from Lesson 4 with a switchable version. Losses are accumulated weighted by batch size, so the last (smaller) batch doesn’t skew the average.

model = TinyCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)

history = {"train_loss": [], "train_acc": [], "val_loss": [], "val_acc": []}

for epoch in range(1, 16):
    tl, ta = run_epoch(model, train_loader, criterion, optimizer)
    vl, va = run_epoch(model, test_loader, criterion)
    for k, v in zip(history, (tl, ta, vl, va)):
        history[k].append(v)
    print(f"epoch {epoch:2d} | train loss {tl:.3f} acc {ta:.3f} "
          f"| val loss {vl:.3f} acc {va:.3f}")

Expected output (yours will vary a little — augmentation and init are random):

epoch  1 | train loss 1.560 acc 0.428 | val loss 1.221 acc 0.559
epoch  2 | train loss 1.147 acc 0.590 | val loss 1.014 acc 0.641
epoch  3 | train loss 0.972 acc 0.656 | val loss 0.868 acc 0.694
epoch  5 | train loss 0.795 acc 0.720 | val loss 0.729 acc 0.746
epoch  8 | train loss 0.657 acc 0.770 | val loss 0.622 acc 0.786
epoch 12 | train loss 0.564 acc 0.803 | val loss 0.568 acc 0.805
epoch 15 | train loss 0.517 acc 0.820 | val loss 0.545 acc 0.815

Around 81–82% validation accuracy in 15 epochs. For calibration: a strong MLP on CIFAR-10 plateaus near 55–60%; random guessing is 10%. The architecture change alone is worth 20+ points.

Now plot the curves — and more importantly, learn what shapes to look for:

import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))
epochs = range(1, len(history["train_loss"]) + 1)
ax1.plot(epochs, history["train_loss"], label="train")
ax1.plot(epochs, history["val_loss"], label="val")
ax1.set(title="Loss", xlabel="epoch"); ax1.legend()
ax2.plot(epochs, history["train_acc"], label="train")
ax2.plot(epochs, history["val_acc"], label="val")
ax2.set(title="Accuracy", xlabel="epoch"); ax2.legend()
plt.tight_layout(); plt.show()

How to read what you got:

Healthy (what you should see today): both losses fall together, validation tracking slightly above train, and — a hallmark of augmentation — train accuracy may sit at or below val accuracy for the first epochs, because the model is graded on harder (cropped, flipped) images during training than at eval time. This is normal and good.
Overfitting: train loss keeps falling while val loss bottoms out and climbs. The gap between the curves is memorization. Try the ablation: rebuild train_ds with test_tf (no augmentation) and retrain — by epoch 15 you’ll typically see train accuracy near 97% while val stalls around 74–76% with val loss rising from epoch ~8. Same model, same data, ~6 points worse — augmentation is doing real work.
Underfitting: both curves plateau at mediocre values with no gap. The model lacks capacity or training time. More epochs, higher learning rate, or a bigger model.

The gap between the curves is your compass for the rest of this course: Lesson 7 exists almost entirely to shrink it.

🧪 Your task

Extend TinyCNN with a third conv block — Conv2d(64, 128) twice, each followed by ReLU, then a MaxPool2d(2) — and fix the classifier so the shapes line up. Before running anything, predict on paper: the spatial size after the third pool, the new Flatten output size, and the new in_features of the first Linear. Then verify with a dummy tensor, print the parameter counts of the old and new models, and train for 15 epochs. You should land around 84–86% validation accuracy — and, surprisingly, with fewer total parameters. Explain why.

Hint: each pool halves the spatial size — 32 → 16 → 8 → 4. Verify your Flatten size with model.features(torch.zeros(1, 3, 32, 32)).shape before wiring up the Linear. For the parameter puzzle, remember which single layer held 94% of TinyCNN’s weights, and what happens to its input size when H and W halve again.

Solution

import torch
import torch.nn as nn

class TinyCNN3(nn.Module):
    def __init__(self, num_classes: int = 10):
        super().__init__()
        self.features = nn.Sequential(
            # Block 1: 3x32x32 -> 32x16x16
            nn.Conv2d(3, 32, kernel_size=3, padding=1), nn.ReLU(),
            nn.Conv2d(32, 32, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 2: 32x16x16 -> 64x8x8
            nn.Conv2d(32, 64, kernel_size=3, padding=1), nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
            # Block 3 (new): 64x8x8 -> 128x4x4
            nn.Conv2d(64, 128, kernel_size=3, padding=1), nn.ReLU(),
            nn.Conv2d(128, 128, kernel_size=3, padding=1), nn.ReLU(),
            nn.MaxPool2d(2),
        )
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 4 * 4, 256),   # 2048, was 4096
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, num_classes),
        )

    def forward(self, x):
        return self.classifier(self.features(x))

# 1) Verify the shape prediction
model3 = TinyCNN3()
with torch.no_grad():
    print(model3.features(torch.zeros(1, 3, 32, 32)).shape)
# torch.Size([1, 128, 4, 4])  -> Flatten gives 2048 ✓

# 2) Parameter counts
count = lambda m: sum(p.numel() for p in m.parameters() if p.requires_grad)
print(f"TinyCNN : {count(TinyCNN()):,}")   # 1,121,450
print(f"TinyCNN3: {count(model3):,}")      #   817,802

# 3) Train — identical loop to the lesson
model3 = model3.to(device)
optimizer = torch.optim.AdamW(model3.parameters(), lr=1e-3, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()

for epoch in range(1, 16):
    tl, ta = run_epoch(model3, train_loader, criterion, optimizer)
    vl, va = run_epoch(model3, test_loader, criterion)
    print(f"epoch {epoch:2d} | train loss {tl:.3f} acc {ta:.3f} "
          f"| val loss {vl:.3f} acc {va:.3f}")
# epoch 15 | train loss 0.412 acc 0.857 | val loss 0.451 acc 0.849  (roughly)

Why deeper yet smaller? The new block adds two convs costing \(128 \times (64 \cdot 9 + 1) + 128 \times (128 \cdot 9 + 1) \approx 221\)K parameters — but the extra pool shrinks the feature map from 8×8 to 4×4, so the first Linear drops from \(4096 \times 256 \approx 1.05\)M weights to \(2048 \times 256 \approx 0.52\)M. The ~525K saved in the head outweighs the ~221K spent on convs. Depth bought more representational power and a smaller model, because dense layers pay per pixel while convs don’t. Accuracy improves to ~85% because the third block sees a larger receptive field and composes more abstract features.

Key takeaways

Convolution wins on images through locality, weight sharing, and translation equivariance — parameter cost scales with kernel and channel sizes, not image size.
Memorize the output-size formula \(\lfloor (H + 2p - k)/s \rfloor + 1\); in practice you mostly need k=3, p=1 (same size) and MaxPool2d(2) (halve it).
Tensors are channels-first (N, C, H, W); Flatten bridges to Linear, and the safe way to size that Linear is a dummy forward pass through features, not arithmetic on paper.
The canonical CNN shape: resolution down, channels up; two 3×3 convs per block beat one big kernel; dense heads hold most of the parameters, which is why deeper-with-more-pooling can mean fewer weights.
Augmentation lives only in the train transform pipeline and must be label-preserving; the canonical CIFAR pair is random crop with padding plus horizontal flip.
Read curves, not final numbers: falling-together is healthy, a widening train/val gap is overfitting, twin high plateaus are underfitting.

In the next lesson: your networks get deeper — and promptly stop training. Lesson 7 brings the stabilizers (batch norm, better initialization, schedulers) and the regularizers that keep deep models honest.

🏠 🔥 Course home | ← Lesson 05 | Lesson 07 → | 📚 All mini-courses