Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 15.1 — Convolution & filters
  • 15.2 — Pooling (max vs average)
  • 15.3 — Padding & stride (and the output-size formula)
  • 15.4 — The receptive field
  • 15.5 — CNN architectures (LeNet, AlexNet, VGG, ResNet)
  • 15.6 — Batch normalization
  • 15.7 — Convolutional neural networks in general
  • 15.8 — Transfer learning (reusing pretrained backbones)
  • 15.9 — Worked size calculation (end to end)
  • 15.10 — Quick reference
  • 15.11 — Key takeaways
  • 15.12 — See also

Chapter 15 — 🖼️ Convolutional Neural Networks

📖 All chapters  |  ← 14 · 🧠 Neural Networks (Core)  |  16 · 🔁 Recurrent & Sequence Models →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

Convolutional Neural Networks (CNNs) are the family of deep models built to understand data laid out on a grid — above all, images. They replace the dense “everything connects to everything” wiring of ordinary neural networks with small sliding filters that look for local patterns, which makes them dramatically more efficient and far better at vision. CNNs sit squarely inside Deep Learning and are the workhorse behind most computer vision systems.

🧭 In context: Deep Learning · used for images, video, audio spectrograms, and any grid-structured signal · the one key idea is learning small reusable filters that slide across the input to detect local patterns.

💡 Remember this: a CNN learns one small filter and slides it across the whole image, so the same pattern is detected everywhere using far fewer weights than a dense network.

15.1 — Convolution & filters

The core problem with feeding an image into a plain neural network is size. A modest 224×224 colour image has 224 × 224 × 3 ≈ 150,000 numbers. Connect that to even one hidden layer of 1,000 neurons and you already have 150 million weights — for one layer. That is wasteful, because a cat in the top-left corner and a cat in the bottom-right corner are the same pattern, yet a dense network has to learn them twice with separate weights.

The everyday analogy. Think of looking for a friend’s face in a “spot the difference” puzzle. You don’t memorise the whole page at once; you slide your attention across it with a small mental template (“does this little patch look like an eye?”), reusing the same template everywhere you look. A convolution is exactly that: one small template, slid across the whole image, asking the same question at every spot.

A convolution fixes the size problem with two ideas: local connectivity (a neuron only looks at a small patch, not the whole image) and parameter sharing (the same small patch-detector is reused at every position). The patch-detector is a tiny grid of weights called a kernel or filter — typically 3×3 or 5×5.

The operation itself is simple. Slide the kernel over the image; at each position, multiply the overlapping numbers element-wise, add them up, and write the single result into an output grid called a feature map (or activation map). That sum is the dot product of the filter with the local patch.

\[ \text{out}[i,j] = \sum_{m}\sum_{n} \text{input}[i+m,\; j+n]\cdot \text{kernel}[m,n] \]

In words: the value at output position \((i,j)\) is what you get by laying the kernel down with its corner at \((i,j)\), multiplying each overlapping pair of numbers, and adding everything up.

Also written: \(\;(\text{input} * \text{kernel})[i,j] = \sum_{m,n} \text{input}[i+m,\,j+n]\,\text{kernel}[m,n]\), where \(*\) denotes the (cross-correlation) convolution operator used by deep-learning libraries.

Here is that sliding window in motion — one amber kernel stepping across the input, dropping one green output cell at each stop:

input — kernel slides over it output feature map each stop → one cell

The SVG below shows a 3×3 kernel sliding over a 5×5 input, producing one cell of the output.

Input 5×5 101 011 100 Kernel 3×3 101 010 101 → dot product → 4 one output cell

Worked example. Take the highlighted patch and kernel above (the “X-shape” detector with 1s on the four corners and the centre):

\[ 1{\cdot}1 + 0{\cdot}0 + 1{\cdot}1 \;+\; 0{\cdot}0 + 1{\cdot}1 + 1{\cdot}0 \;+\; 1{\cdot}1 + 0{\cdot}0 + 0{\cdot}1 = 4 \]

Slide one step right and repeat for every position. Early filters in a trained CNN learn to detect edges, corners and colour blobs; deeper filters combine those into textures, then object parts, then whole objects.

import numpy as np
def conv2d(img, k):                       # valid convolution, no padding
    kh, kw = k.shape
    H, W = img.shape
    out = np.zeros((H-kh+1, W-kw+1))
    for i in range(out.shape[0]):
        for j in range(out.shape[1]):
            out[i,j] = (img[i:i+kh, j:j+kw] * k).sum()   # patch ⊙ kernel
    return out

img = np.array([[1,0,1,2,1],[0,1,1,0,2],[1,0,0,1,1],[2,1,0,1,0],[1,1,2,0,1]])
ker = np.array([[1,0,1],[0,1,0],[1,0,1]])
o = conv2d(img, ker)
assert o[0,0] == 4                        # matches the hand calc above
print(o)

The same operation in a real framework is a single layer. PyTorch’s Conv2d handles the channel bookkeeping (§15.7), the sliding, and the learnable weights for you:

import torch, torch.nn as nn
x = torch.randn(1, 3, 32, 32)             # batch=1, 3 channels, 32×32
conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=3, padding=1)
y = conv(x)
print(y.shape)                            # torch.Size([1, 16, 32, 32])
print(conv.weight.shape)                  # [16, 3, 3, 3]  = out×in×Kh×Kw

Where this shows up. A blur is just a 3×3 kernel of all \(1/9\) (averaging neighbours); a vertical-edge detector is the Sobel kernel \(\begin{smallmatrix}-1&0&1\\-2&0&2\\-1&0&1\end{smallmatrix}\). For decades, image editors hand-designed these filters. A CNN’s only new trick is to learn the numbers in the kernel from data instead of hand-picking them.

Tip

Intuition: a convolution layer is a stack of pattern-stamps. Each filter stamps across the whole image asking “is my pattern here?” — and parameter sharing means it learns that pattern once and detects it everywhere.

15.2 — Pooling (max vs average)

After a convolution, neighbouring feature-map values are often redundant — an edge detected at pixel (10,10) usually also fires at (10,11). Pooling shrinks the feature map by summarising each small region with a single number. This cuts computation, and it grants translation invariance: nudge the input a pixel or two and the pooled output barely changes.

The two common kinds:

  • Max pooling takes the largest value in each window. It keeps the strongest activation — “was the feature present anywhere in this region?” — and is the default in most vision CNNs.
  • Average pooling takes the mean. It smooths rather than picks; it shows up mainly as global average pooling at the end of modern networks, collapsing each whole feature map to one number.

Pooling has no learnable parameters — it is a fixed rule. A 2×2 window with stride 2 halves both height and width.

Input 4×4 1321 4201 0152 1230 max 2×2 → 42 25 avg would give 2.5,1.0, 1.0,2.5

Worked example. Top-left 2×2 window is \(\{1,3,4,2\}\): max pooling outputs \(4\), average pooling outputs \((1+3+4+2)/4 = 2.5\). Repeat across the four windows to get the 2×2 results shown.

Max pooling Average pooling
Keeps strongest activation overall level
Good for detecting presence of a feature smoothing, final global pooling
Effect on noise passes spikes through dampens spikes
Where used throughout VGG/AlexNet global pool head in ResNet
import torch, torch.nn as nn
x = torch.tensor([[[[1.,3,2,1],[4,2,0,1],[0,1,5,2],[1,2,3,0]]]])  # 1×1×4×4
print(nn.MaxPool2d(2)(x).squeeze())       # tensor([[4., 2.], [2., 5.]])
print(nn.AvgPool2d(2)(x).squeeze())       # tensor([[2.5, 1.0], [1.0, 2.5]])
print(nn.AdaptiveAvgPool2d(1)(x).item())  # global average → one number
Tip

Intuition: max pooling is a tolerance buffer for where a feature sits. As long as the strongest response lands somewhere in the window, the exact pixel stops mattering — that is what makes the later layers care about what is present, not precisely where.

15.3 — Padding & stride (and the output-size formula)

Two knobs control how a filter sweeps the input. Stride is the step size: stride 1 moves one pixel at a time; stride 2 jumps two pixels, roughly halving the output and the compute. Padding adds a border of (usually zero) pixels around the input so the filter can sit on the edges.

Why pad? Without it, every convolution shrinks the image — a 3×3 filter trims one pixel off each side — and corner pixels get visited by fewer windows than centre pixels, so edge information leaks away. “Same” padding adds just enough border to keep the output the same size as the input; “valid” padding means no padding (the output shrinks).

“same” padding: a zero border lets the filter reach the corners 0 0 0 0 0 real pixels → output stays the same H×W as the input drop the border = “valid” = output shrinks

The output size along one dimension is:

\[ O = \left\lfloor \frac{W - K + 2P}{S} \right\rfloor + 1 \]

In words: start with the input width \(W\), subtract the kernel width \(K\), add the padding on both sides \(2P\), divide by the stride \(S\) to count how many full steps fit, round down, and add one for the starting position.

Also written: \(\;O = \left\lfloor \dfrac{W + 2P - K}{S} \right\rfloor + 1\) — the same expression with the terms reordered; some texts write \(W_{\text{in}}\) and \(W_{\text{out}}\) for clarity.

where \(W\) = input size, \(K\) = kernel size, \(P\) = padding, \(S\) = stride.

def out_size(W, K, P, S):
    return (W - K + 2*P)//S + 1

assert out_size(5, 3, 0, 1) == 3     # 5→3, valid, no padding shrinks
assert out_size(5, 3, 1, 1) == 5     # 'same' padding keeps size
assert out_size(7, 3, 0, 2) == 3     # stride 2 roughly halves

For “same” padding with stride 1, the border you need is \(P = (K-1)/2\) — so \(1\) for a 3×3 filter, \(2\) for a 5×5. That is exactly why odd kernel sizes are preferred: they have a clean centre and a symmetric pad.

Warning

Common mistake: forgetting the floor. With \(W=7,\,K=3,\,P=0,\,S=2\) the formula gives \(\lfloor 4/2\rfloor + 1 = 3\), not \(4\). When the stride does not divide evenly, the rightmost pixels are simply dropped — a silent off-by-one that breaks shape assertions deep in a network.

15.4 — The receptive field

Intuition first. Picture a periscope made of stacked lenses. A neuron deep in the network can only “see” the input through the chain of filters below it — and each layer widens the cone of pixels that feed into it. The receptive field is the size of that cone: the patch of the original image that influences one value in a feature map. Early layers see a tiny window; deep layers see most of the picture, which is why they can recognise whole objects rather than just edges.

For a stack of convolutions, the receptive field grows by \((K-1)\) per layer, scaled by the product of all earlier strides. The recurrence is:

\[ R_{\ell} = R_{\ell-1} + (K_{\ell}-1)\prod_{i=1}^{\ell-1} S_i \]

In words: the receptive field at layer \(\ell\) equals the previous layer’s receptive field plus the new kernel’s extra reach \((K_\ell-1)\), magnified by how much everything below has already downsampled (the product of earlier strides).

Also written: with all strides equal to 1, this collapses to the simple sum \(R_L = 1 + \sum_{\ell=1}^{L}(K_\ell - 1)\) — e.g. three stacked 3×3 convs give \(1 + 3\times2 = 7\), matching one 7×7 filter.

Stacked 3×3 convs widen the receptive field (rings light up in turn) pink = 1 output value of conv-3 amber = its 3×3 reach in conv-2 green = 5×5 reach in conv-1 indigo = 7×7 reach in the input

This is the deeper reason VGG stacks small filters (§15.5): two 3×3 layers reach as far as one 5×5 but with fewer parameters and an extra non-linearity. When a task needs a large receptive field cheaply — segmentation, audio — dilated (atrous) convolutions insert gaps between kernel taps, expanding the field exponentially without adding weights or losing resolution.

import torch.nn as nn
# dilation=2 makes a 3×3 kernel reach across a 5×5 span, same 9 weights
nn.Conv2d(64, 64, kernel_size=3, padding=2, dilation=2)
Tip

Rule of thumb: if your model can’t “see” a whole object — e.g. it misses large shapes but nails small textures — its receptive field is too small for the input. Add depth, increase stride/pooling, or use dilation, rather than blindly widening kernels.

15.5 — CNN architectures (LeNet, AlexNet, VGG, ResNet)

CNN design history is a story of going deeper while keeping training stable. Four landmarks tell most of it.

LeNet-5 (1998) was the original — a tiny network for reading handwritten digits, with two conv layers, pooling, and a couple of dense layers. It proved the convolution-then-pool recipe worked.

AlexNet (2012) was the same recipe scaled up and run on GPUs, and it crushed the ImageNet competition, kicking off the deep-learning boom. Its contributions were practical: ReLU activations (faster training than tanh), dropout for regularisation, and data augmentation.

VGG (2014) made the design uniform: stack many 3×3 convolutions, double the channel count after each pooling step. Its insight — two stacked 3×3 filters see the same region as one 5×5 filter but with fewer parameters and an extra non-linearity — is why 3×3 became the standard.

ResNet (2015) cracked the depth barrier. Below is the trend and the key fix.

Network Year Depth Key idea
LeNet-5 1998 7 conv+pool works
AlexNet 2012 8 ReLU, dropout, GPUs
VGG-16 2014 16 uniform 3×3 stacks
ResNet-50 2015 50 residual connections

The depth jump over those 17 years is hard to feel from a table, so here it is to scale — each bar is one network’s layer count:

LeNet-57 AlexNet8 VGG-1616 ResNet-5050 depth →

Residual connections. Naively stacking more layers made networks worse — gradients vanished and very deep nets were hard to optimise. ResNet adds a skip connection that lets the input bypass a block and be added to its output:

\[ y = F(x) + x \]

In words: the block’s output is whatever transformation \(F\) computes from \(x\), plus the original \(x\) carried straight through — so the layers only have to learn the adjustment to \(x\), not rebuild it from scratch.

Also written: \(\;y = x + F(x;\,W)\), where \(F(x;W)\) is the stacked conv–ReLU–conv path with learnable weights \(W\); the two forms are identical since addition commutes.

Instead of forcing the block to learn the full mapping, it only has to learn the residual \(F(x)\) — the change to apply to \(x\). If the best thing a block can do is nothing, learning \(F(x)=0\) is trivial, so adding depth never hurts. The \(+x\) term also gives gradients a clean highway straight back through the network.

flowchart LR
    x(["x"]) --> C1["3×3 conv → ReLU"] --> C2["3×3 conv"] --> add((+))
    x -- "skip / identity" --> add
    add --> R["ReLU"] --> y(["y = F(x) + x"])

Which backbone should I reach for?

flowchart TD
    A["Need a CNN backbone?"] --> B{"Deep net,<br/>top accuracy?"}
    B -- yes --> R["ResNet (residual blocks)<br/>— the safe default"]
    B -- no --> C{"Tight compute /<br/>mobile?"}
    C -- yes --> M["MobileNet / 1×1 bottlenecks"]
    C -- no --> D{"Just learning<br/>or tiny task?"}
    D -- yes --> L["LeNet / small VGG"]
    D -- no --> V["VGG-style 3×3 stacks"]

15.6 — Batch normalization

Intuition first. Imagine a relay race where each runner is handed a baton whose weight keeps changing run to run — they can never settle into a rhythm. Deep layers face the same problem: as the weights below them update, the distribution of inputs each layer receives keeps shifting (often called internal covariate shift), so every layer is forever chasing a moving target. Batch normalization hands each layer a baton of standardised weight: it re-centres and re-scales each feature to roughly zero mean and unit variance, using the statistics of the current mini-batch.

The picture: BatchNorm takes a lopsided spread of activations and slides + squeezes it back to a tidy bell around zero, every batch.

before: off-centre, wide → after: centred, unit width 0

For a feature \(x\) over a mini-batch \(\mathcal{B}\):

\[ \hat{x} = \frac{x - \mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}, \qquad y = \gamma\,\hat{x} + \beta \]

In words: subtract the batch mean and divide by the batch standard deviation to standardise the feature, then let the network rescale and re-shift it with two learnable knobs \(\gamma\) (scale) and \(\beta\) (shift) so it can undo the normalisation if that helps.

Also written: \(\;y = \gamma\,\dfrac{x-\mu_{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2+\epsilon}} + \beta\), where \(\epsilon\) is a tiny constant (\(\sim10^{-5}\)) that prevents division by zero.

The payoffs are large: training tolerates higher learning rates, is less sensitive to weight initialisation, and BatchNorm acts as a mild regulariser (the batch noise nudges activations around). At inference time there is no batch, so the layer switches to running averages of \(\mu\) and \(\sigma^2\) collected during training — a frequent source of bugs if you forget to put the model in eval() mode.

import torch.nn as nn
block = nn.Sequential(
    nn.Conv2d(64, 64, 3, padding=1, bias=False),  # bias is redundant before BN
    nn.BatchNorm2d(64),                           # normalize the 64 channels
    nn.ReLU(),
)
# model.train() uses batch stats; model.eval() uses running averages
Note

Where it sits: the standard order is Conv → BatchNorm → ReLU. Because BatchNorm re-centres the output, the conv’s bias term becomes redundant, so it is usually disabled (bias=False). Related variants — LayerNorm, GroupNorm, InstanceNorm — normalise over different axes and are preferred when batches are tiny or the task is style transfer.

15.7 — Convolutional neural networks in general

Step back and look at the whole pipeline. A CNN processes an image in two phases: a feature extractor of stacked conv + activation + pooling blocks that turns raw pixels into increasingly abstract feature maps, followed by a classifier head (a global pool plus one or two dense layers) that maps those features to class scores, usually finished with a softmax.

The key bookkeeping is channels, and the one idea that trips people up is this: a convolution filter is not flat. A colour image has 3 stacked sheets — red, green, blue. A “3×3 filter” on it is really a little 3×3×3 cube of weights: it covers a 3×3 patch on all three sheets at once, multiplies, and adds everything into one number. So each filter, wherever it lands, outputs a single value — collapsing every input channel into one.

A conv layer is just many such cubes side by side. Put 64 filters in the layer and you get 64 output sheets — a feature map with 64 channels. Stack layers and a pattern emerges: spatial size shrinks (pooling and stride), while channel count grows. The network is steadily trading “where” (precise location) for “what” (rich pattern identity).

flowchart LR
    I["Image 224×224×3"] --> B1["Conv+ReLU+Pool 112×112×64"]
    B1 --> B2["Conv+ReLU+Pool 56×56×128"]
    B2 --> B3["Conv+ReLU+Pool 28×28×256"]
    B3 --> GAP["Global avg pool 1×1×256"]
    GAP --> FC["Dense → softmax 1000 classes"]

A single filter’s parameter count is \(K_h \times K_w \times C_{in} + 1\) (the \(+1\) is the bias); a layer with \(C_{out}\) filters has \(C_{out}\) times that.

\[ \#\text{params} = (K_h \times K_w \times C_{in} + 1)\times C_{out} \]

In words: each filter has one weight per cell of its \(K_h\times K_w\) window across every input channel \(C_{in}\), plus one bias; multiply that per-filter count by how many filters \(C_{out}\) the layer holds.

Also written: \(\;\#\text{params} = C_{out}\,(C_{in}K_hK_w + 1)\) — the bias term drops to \(C_{out}\,C_{in}K_hK_w\) when bias is disabled (e.g. before BatchNorm, §15.6).

Note this is independent of image size — the same filter handles a 32×32 or a 512×512 input. That is the parameter-sharing payoff from §15.1, made concrete.

The 1×1 convolution. A filter of size 1×1 looks at a single pixel but all its channels — it is a tiny dense layer applied identically at every spatial location. Its job is to mix and re-project channels: turn 256 channels into 64 (cheap dimensionality reduction) or 64 into 256, without touching spatial size. This “bottleneck” trick is what lets ResNet-50 stay affordable, and it underpins the Inception and MobileNet families — and the same channel-reprojection idea reappears inside Vision Transformers.

import torch.nn as nn
nn.Conv2d(256, 64, kernel_size=1)   # squeeze 256 channels → 64, per pixel

15.8 — Transfer learning (reusing pretrained backbones)

Training a deep CNN from scratch needs millions of labelled images and a lot of GPU time. Almost nobody does it. The early-to-middle layers of any vision CNN learn generic features — edges, textures, shapes — that are useful for nearly every image task. Transfer learning reuses those layers.

The everyday analogy. Hiring someone who already speaks the language and just needs to learn your company’s jargon beats training a newborn from scratch. A pretrained backbone already “speaks vision”; you only teach it your handful of labels.

The recipe: take a network already trained on a huge dataset (ImageNet, 1.2M images), call its feature extractor the backbone, discard its original classifier head, and attach a fresh small head for your task. (This same backbone-and-head pattern is the basis of post-training and fine-tuning for large models.) Then either freeze the backbone and train only the head (fast, needs little data), or fine-tune — unfreeze some top layers and train them too at a small learning rate, for a bit more accuracy when you have more data.

reuse the frozen backbone, swap only the head edges textures shapes 🔒 frozen — pretrained on ImageNet → new head your 5 labels ✏️ trains — a few hundred images
# real PyTorch / torchvision: reuse a pretrained ResNet for a 5-class problem
import torch.nn as nn
from torchvision.models import resnet50, ResNet50_Weights

backbone = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
for p in backbone.parameters():
    p.requires_grad = False                 # freeze generic features
backbone.fc = nn.Linear(2048, 5)            # new head, only this trains
# train head; optionally later unfreeze layer4 at lr=1e-4 to fine-tune

The scikit-learn / Keras equivalent follows the same shape — load weights, drop the top, add a head:

from tensorflow.keras.applications import ResNet50
from tensorflow.keras import layers, Model

base = ResNet50(weights="imagenet", include_top=False, pooling="avg")
base.trainable = False                       # freeze backbone
out = layers.Dense(5, activation="softmax")(base.output)
model = Model(base.input, out)

Rule of thumb for choosing: small dataset + similar to ImageNet → freeze, train head only. Large dataset or a very different domain (medical scans, satellite) → fine-tune more layers.

Tip

Intuition: a pretrained backbone has already learned to see. You are only teaching it the vocabulary of your specific labels — a few hundred examples often suffice, versus the millions needed to learn vision from zero.

15.9 — Worked size calculation (end to end)

Let us trace a tensor’s shape through a small VGG-style block so the formulas from §15.3 and the channel rules from §15.7 click together. Input: a 32×32 RGB image, shape 32×32×3.

Layer K, P, S, filters Output H×W Channels Why
Input — 32×32 3 RGB
Conv1 3,1,1, 16 32×32 16 same pad: \((32-3+2)/1+1=32\)
MaxPool 2,0,2 16×16 16 \((32-2)/2+1=16\)
Conv2 3,1,1, 32 16×16 32 same pad keeps 16
MaxPool 2,0,2 8×8 32 halves to 8
Flatten — — 2048 \(8\times8\times32\)
Dense → 10 — 10 class scores

Each conv keeps spatial size (same padding), each pool halves it, and channels grow as filters increase. The final feature map is \(8\times8\times32 = 2048\) numbers, flattened and fed to a 10-way classifier.

def out_size(W, K, P, S): return (W - K + 2*P)//S + 1
W = 32
W = out_size(W,3,1,1); assert W == 32     # conv1, same padding
W = out_size(W,2,0,2); assert W == 16     # pool
W = out_size(W,3,1,1); assert W == 16     # conv2
W = out_size(W,2,0,2); assert W == 8      # pool
assert W*W*32 == 2048                     # flattened length

Now count Conv1’s parameters: \(3\times3\times3\) weights per filter \(+1\) bias \(= 28\), times \(16\) filters \(= 448\). Tiny — and that number would be identical whether the input were 32×32 or 4000×4000. That size-independence is the whole reason CNNs scale to real images.

The very same block, written as a runnable PyTorch module, lets the framework verify every shape for you:

import torch, torch.nn as nn
net = nn.Sequential(
    nn.Conv2d(3, 16, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.Conv2d(16, 32, 3, padding=1), nn.ReLU(), nn.MaxPool2d(2),
    nn.Flatten(), nn.Linear(2048, 10),
)
print(net(torch.randn(1, 3, 32, 32)).shape)   # torch.Size([1, 10])
Warning

Common mistake: mismatching the flattened length when you change image size or padding. If you swap Conv1 to “valid” padding, every downstream shape shifts and the Dense layer’s input dimension breaks. Re-run the size formula end-to-end after any change to K, P, S, or input size.

15.10 — Quick reference

Term / formula Meaning When / why it matters
Convolution Slide a small kernel over the input, dot-product each patch Core layer; local connectivity + parameter sharing cut weights massively
Kernel / filter Tiny grid of learnable weights (3×3, 5×5) One filter = one pattern detector reused everywhere
Feature (activation) map Output grid of one filter across the input Each filter produces one channel of output
Max pooling Take the largest value in each window Default downsampler; keeps “is the feature present?”
Average / global pooling Mean of a window; or collapse a whole map to one number Smoothing; global-avg head in ResNet-style nets
Stride \(S\) Step size of the sliding window \(S{=}2\) roughly halves output size and compute
Padding \(P\) Border (usually zeros) added around input “Same” keeps size, “valid” shrinks; protects edge pixels
Output size \(O = \lfloor (W - K + 2P)/S \rfloor + 1\) Compute every layer’s shape; the floor silently drops pixels
“Same” pad (stride 1) \(P = (K-1)/2\) Why odd kernel sizes are preferred (clean centre)
Receptive field Input patch one neuron can “see” Must cover the object; grows with depth, dilation enlarges cheaply
Dilated (atrous) conv Insert gaps between kernel taps Big receptive field, no extra weights — segmentation, audio
Residual block \(y = F(x) + x\) Skip connection; makes very deep nets trainable
Batch normalization \(\hat{x}=(x-\mu_{\mathcal B})/\sqrt{\sigma_{\mathcal B}^2+\epsilon}\), then \(\gamma\hat x+\beta\) Higher LRs, steadier training; use eval() at inference
Conv params \((K_h K_w C_{in}+1)\,C_{out}\) Independent of image size — the scaling payoff
1×1 convolution Per-pixel dense layer mixing channels Cheap channel re-projection (bottlenecks)
Transfer learning Reuse a pretrained backbone, swap the head Standard practice; freeze for small data, fine-tune for large

15.11 — Key takeaways

  • A convolution slides a small filter over the input; local connectivity + parameter sharing make it efficient and translation-aware — one pattern learned once, detected everywhere.
  • Pooling downsamples feature maps; max keeps the strongest activation (default), average smooths (used for global pooling heads). No learnable parameters.
  • Output size is \(O = \lfloor (W - K + 2P)/S \rfloor + 1\); “same” padding needs \(P=(K-1)/2\), and the floor silently drops leftover pixels.
  • The receptive field is the input patch one deep neuron sees; it grows with depth, and dilated convs enlarge it cheaply.
  • Architectures went deeper over time — LeNet → AlexNet → VGG → ResNet — and residual connections (\(y=F(x)+x\)) made very deep nets trainable.
  • Batch normalization standardises each layer’s inputs (\(\hat{x}=(x-\mu)/\sqrt{\sigma^2+\epsilon}\), then rescaled by \(\gamma,\beta\)), enabling higher learning rates and steadier training; remember eval() mode at inference.
  • A CNN = feature extractor (conv/pool blocks, spatial size shrinks while channels grow) + classifier head; filters span all input channels, and 1×1 convs re-project channels cheaply.
  • Transfer learning reuses a pretrained backbone: freeze for small/similar data, fine-tune for large/different data — the standard way to build vision models.

15.12 — See also

  • Neural Networks (Core) — activations, backprop, dropout, and the dense layers CNNs build on.
  • Computer Vision — detection, segmentation, and the tasks CNNs power.
  • Generative Models — convolutional GANs and autoencoders that run convolutions in reverse.
  • Attention & Transformers — Vision Transformers, the modern alternative to convolutional backbones.
  • Optimization — SGD, learning-rate schedules, and the gradient flow residual connections protect.
  • MLOps & Deployment — serving and compressing trained CNN backbones in production.

↪ The thread continues → Chapter 16 · 🔁 Recurrent & Sequence Models

CNNs see space; but language, audio, and time unfold in sequence. To handle order and memory we turn to recurrent networks — and meet the bottleneck the next breakthrough would shatter.


📖 All chapters  |  ← 14 · 🧠 Neural Networks (Core)  |  16 · 🔁 Recurrent & Sequence Models →

 

© Kader Mohideen