📊 Deep Learning with TensorFlow & Keras · Day 7 — Regularization & Tuning: Making the CNN Behave

🏠 📊 Course home | ← Day 06 | Day 08 → | 📚 All mini-courses

Day 7 — Regularization & Tuning: Making the CNN Behave

Yesterday’s CNN learned CIFAR-10 — and then kept learning it a little too well. If you let the Day-6 model run for 30 epochs, you saw the classic signature: training accuracy marching toward 99% while validation accuracy stalled in the low 70s and then drifted down. The network stopped learning the dataset and started memorizing it. Today we fix that, and we stop guessing at hyperparameters while we’re at it. We’ll upgrade the Day-6 CNN with the four workhorses of practical deep learning — BatchNorm, Dropout, L2 weight decay, and a cosine learning-rate schedule — turn on mixed precision with a single line, and then let keras_tuner search the knobs we’d otherwise set by superstition. At the end we put the upgraded model head-to-head against the baseline.

🎯 Today you will: place Dropout and BatchNorm correctly inside a conv block, add L2 weight decay via kernel_regularizer, schedule the learning rate with warmup + CosineDecay, enable mixed precision in one line, and run a keras_tuner RandomSearch over units and learning rate

Baseline: where Day 6 left us

Let’s reconstruct a compact version of the Day-6 setup so today is self-contained. Same data, same tf.data pipeline from Day 3, same plain CNN.

import keras
from keras import layers
import tensorflow as tf

(x_train, y_train), (x_test, y_test) = keras.datasets.cifar10.load_data()
x_train = x_train.astype("float32") / 255.0
x_test  = x_test.astype("float32") / 255.0
y_train, y_test = y_train.squeeze(), y_test.squeeze()

# Hold out the last 5k training images for validation
x_val, y_val = x_train[45000:], y_train[45000:]
x_train, y_train = x_train[:45000], y_train[:45000]

BATCH = 128
train_ds = (tf.data.Dataset.from_tensor_slices((x_train, y_train))
            .shuffle(10_000).batch(BATCH).prefetch(tf.data.AUTOTUNE))
val_ds = (tf.data.Dataset.from_tensor_slices((x_val, y_val))
          .batch(BATCH).prefetch(tf.data.AUTOTUNE))

Nothing new here — (45000, 32, 32, 3) flows in, batches of (128, 32, 32, 3) flow out. The baseline model is Day 6’s architecture, unregularized:

def make_baseline():
    return keras.Sequential([
        layers.Input(shape=(32, 32, 3)),
        layers.Conv2D(32, 3, padding="same", activation="relu"),
        layers.Conv2D(32, 3, padding="same", activation="relu"),
        layers.MaxPooling2D(),
        layers.Conv2D(64, 3, padding="same", activation="relu"),
        layers.Conv2D(64, 3, padding="same", activation="relu"),
        layers.MaxPooling2D(),
        layers.Flatten(),
        layers.Dense(128, activation="relu"),
        layers.Dense(10),                      # logits
    ], name="day6_baseline")

baseline = make_baseline()
baseline.compile(
    optimizer=keras.optimizers.Adam(1e-3),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
hist_base = baseline.fit(train_ds, validation_data=val_ds, epochs=30, verbose=2)

Typical result (yours will wobble a couple of points):

Epoch 30/30
352/352 - loss: 0.0421 - accuracy: 0.9871 - val_loss: 1.9843 - val_accuracy: 0.7188

Training accuracy 98.7%, validation 71.9%, and validation loss nearly quadrupled from its epoch-8 minimum. That 27-point gap is our enemy today. Everything that follows attacks it from a different angle: BatchNorm smooths optimization, Dropout and L2 penalize memorization, and the LR schedule stops the optimizer from bouncing around the minimum at the end.

Dropout & BatchNorm: placement is the whole game

Both layers are one-liners in Keras. The reason they deserve a section is that where you put them matters more than whether you use them.

BatchNorm normalizes each channel of its input to roughly zero mean and unit variance using batch statistics during training, then learns a scale \(\gamma\) and shift \(\beta\) to restore expressive power. The canonical placement in a conv block is Conv → BatchNorm → Activation. Two consequences fall out of that ordering:

The convolution’s bias is redundant — BatchNorm’s \(\beta\) immediately replaces any constant offset. So we pass use_bias=False to save parameters.
The activation sees normalized inputs, which keeps ReLU from dying and lets you use higher learning rates.

Dropout randomly zeroes activations during training (and rescales the survivors by \(1/(1-p)\) so expected magnitude is unchanged). Placement rules of thumb:

In the dense head, after the activation — this is where CNNs overfit hardest, so rates of 0.3–0.5 are normal.
In conv stacks, standard Dropout is weak because neighboring pixels are correlated: zeroing one pixel of a feature map barely removes information. Use SpatialDropout2D, which drops entire channels, or place a modest Dropout after each pooling stage.
Never sandwich Dropout directly before BatchNorm. Dropout changes the activation variance between training and inference, so BatchNorm’s moving statistics (collected during training) mismatch what it sees at test time. Keep Dropout after the BN→activation pair, or confine it to the head.

flowchart LR
    subgraph block["One regularized conv block"]
        A["Conv2D<br/>use_bias=False<br/>+ L2 on kernel"] --> B["BatchNorm"]
        B --> C["ReLU"]
        C --> D["Conv2D → BN → ReLU"]
        D --> E["MaxPool2D"]
        E --> F["Dropout p=0.2"]
    end
    F --> G["next block /<br/>dense head"]
    style A fill:#6366f180
    style B fill:#22c55e80
    style C fill:#f59e0b80
    style F fill:#ec489980

One more thing Keras quietly handles for you: the training flag. In PyTorch you must call model.train() and model.eval() yourself, and forgetting eval() before validation is a rite of passage — BatchNorm keeps using batch statistics and Dropout keeps dropping, corrupting your metrics. In Keras, fit() calls layers with training=True and evaluate()/predict() with training=False automatically. You only manage the flag yourself in custom loops (Day 4), where you write model(x, training=True) explicitly.

Here is the regularized block as a reusable function, functional-API style (Day 2):

def conv_block(x, filters, l2, drop):
    reg = keras.regularizers.L2(l2)
    x = layers.Conv2D(filters, 3, padding="same", use_bias=False,
                      kernel_regularizer=reg)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Conv2D(filters, 3, padding="same", use_bias=False,
                      kernel_regularizer=reg)(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.MaxPooling2D()(x)
    x = layers.Dropout(drop)(x)
    return x

Shapes through one block with filters=32: (128, 32, 32, 3) → Conv → (128, 32, 32, 32) → BN/ReLU (shape-preserving) → second Conv → (128, 32, 32, 32) → MaxPool → (128, 16, 16, 32) → Dropout (shape-preserving, only active in training).

L2 weight decay and the loss you don’t see

L2 regularization adds a penalty proportional to the squared weights, pulling every parameter gently toward zero unless the data insists otherwise:

\[\mathcal{L} = \mathcal{L}_{\text{data}} + \lambda \sum_{l} \lVert W_l \rVert_2^2\]

In Keras this is attached per layer, not per optimizer:

layers.Conv2D(64, 3, kernel_regularizer=keras.regularizers.L2(1e-4))

This is a genuine philosophical difference from PyTorch, where you’d write torch.optim.AdamW(params, weight_decay=1e-4) and decay everything the optimizer sees (unless you build parameter groups). The Keras way is more surgical: note we regularize only kernel_regularizer — the weights — and leave biases and BatchNorm’s \(\gamma\)/\(\beta\) alone. Decaying BatchNorm parameters or biases usually hurts, and the per-layer API makes the right thing the default.

Where does the penalty actually go? Each regularized layer appends a scalar tensor to model.losses. compile()/fit() sums them into the training loss automatically. You can watch it happen:

inp = layers.Input(shape=(32, 32, 3))
out = conv_block(inp, 32, l2=1e-4, drop=0.2)
probe = keras.Model(inp, out)

_ = probe(tf.zeros((1, 32, 32, 3)))   # build the layers
print([float(l) for l in probe.losses])

[0.00031178, 0.00089442]    # one scalar per regularized Conv2D

Two things to burn into memory:

fit() adds model.losses for you; a custom training loop does not. If you write your own GradientTape loop (Day 4) and forget loss += tf.add_n(model.losses), your regularizer silently does nothing. No error, no warning — just quietly absent regularization. This is one of the most common “why doesn’t L2 change anything?” bugs.
The regularization term is included in the reported training loss but not in validation loss (validation reports pure data loss). So a regularized model can show training loss slightly above validation loss early on. That’s not a bug; it’s the penalty term.

Cosine decay, warmup, and mixed precision in one line

A constant learning rate is a compromise: too small and early training crawls, too large and late training bounces around the minimum, never settling. The fix is a schedule — and the one that has quietly become the default across modern deep learning is cosine decay:

\[\eta(t) = \eta_{\min} + \tfrac{1}{2}\,(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi t}{T}\right)\]

It holds the rate high early (fast progress), then glides smoothly to near zero (fine convergence), with no cliff-edge drops to tune. Combined with a short linear warmup — a few epochs ramping from 0 up to \(\eta_{\max}\), which protects the fresh BatchNorm statistics and Adam moment estimates from a violent first few steps — the shape looks like this:

Keras 3 bundles warmup into CosineDecay, so it’s one object:

EPOCHS = 30
steps_per_epoch = len(x_train) // BATCH          # 351

lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.0,          # warmup starts here
    warmup_target=1e-3,                 # ... ramps linearly to here
    warmup_steps=3 * steps_per_epoch,   # over 3 epochs
    decay_steps=(EPOCHS - 3) * steps_per_epoch,  # then cosine over the rest
    alpha=0.02,                         # floor: final LR = 0.02 * 1e-3
)

optimizer = keras.optimizers.Adam(learning_rate=lr_schedule)

The subtlety everyone hits once: decay_steps counts steps, not epochs, and it starts after warmup ends. If you pass decay_steps=EPOCHS (forgetting steps_per_epoch), the schedule collapses to its floor within the first 30 batches and your model trains the remaining 29.9 epochs at alpha * warmup_target — glacially. Always compute steps explicitly. You can sanity-check the schedule by calling it like a function:

for step in [0, 3 * steps_per_epoch, 15 * steps_per_epoch, 29 * steps_per_epoch]:
    print(f"step {step:6d}  →  lr = {float(lr_schedule(step)):.6f}")

step      0  →  lr = 0.000000
step   1053  →  lr = 0.001000
step   5265  →  lr = 0.000541
step  10179  →  lr = 0.000029

In PyTorch this is optim.lr_scheduler.CosineAnnealingLR plus a separate warmup scheduler plus remembering to call scheduler.step() at the right granularity. In Keras the schedule is the learning rate — the optimizer queries it with the current step at every update, and there is nothing to step manually.

Mixed precision is the promised one-liner. Put it at the top of your script, before any layers are built:

keras.mixed_precision.set_global_policy("mixed_float16")

Every layer now computes in float16 (fast, half the memory, uses GPU Tensor Cores) while keeping its variables in float32 (stable). Keras also wraps your optimizer in loss scaling automatically — the trick that keeps small float16 gradients from underflowing to zero. In PyTorch this is the torch.autocast context manager plus a manually managed GradScaler; here the policy handles both.

One rule comes with it: the final layer’s output must be float32, because softmax/cross-entropy in float16 loses precision exactly where it matters. Override the dtype on the last layer only:

layers.Dense(10, dtype="float32")   # logits computed and emitted in float32

On a GPU with Tensor Cores (T4, V100, A100, RTX…) expect a 1.5–3× throughput bump for this model. On CPU it does nothing useful — skip it there.

Hyperparameter search with KerasTuner

We now have real knobs: dense units, dropout rate, L2 strength, peak learning rate. Guessing them by hand is folklore; searching them is engineering. Install the tuner (pip install keras-tuner) and express the model as a function of a hyperparameter object hp:

import keras_tuner as kt

def build_model(hp):
    units = hp.Int("units", min_value=64, max_value=256, step=64)
    drop  = hp.Float("dropout", 0.2, 0.5, step=0.1)
    lr    = hp.Float("lr", 1e-4, 1e-2, sampling="log")
    l2    = 1e-4   # fixed for the search; we tune it in the exercise

    inputs = layers.Input(shape=(32, 32, 3))
    x = conv_block(inputs, 32, l2=l2, drop=0.2)
    x = conv_block(x, 64, l2=l2, drop=0.2)
    x = layers.Flatten()(x)
    x = layers.Dense(units, use_bias=False,
                     kernel_regularizer=keras.regularizers.L2(l2))(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(drop)(x)
    outputs = layers.Dense(10, dtype="float32")(x)

    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(lr),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )
    return model

Read the hp calls carefully — each one declares a search dimension and returns the value chosen for the current trial:

hp.Int("units", 64, 256, step=64) → tries 64, 128, 192, 256.
hp.Float("lr", 1e-4, 1e-2, sampling="log") → samples uniformly in log-space, so 3e-4 and 3e-3 are equally likely. Linear sampling here would waste most trials above 5e-3; learning rates live on a log scale, always sample them that way.
The names ("units", "lr") are the keys you’ll use to read the winners back out.

Note we search with a constant LR per trial, not the cosine schedule — the schedule’s peak depends on the LR we’re searching for, so we find the right magnitude first and build the schedule around it afterward. Now the tuner:

tuner = kt.RandomSearch(
    build_model,
    objective="val_accuracy",
    max_trials=10,          # 10 random points in the search space
    executions_per_trial=1, # train each point once (2+ averages out seed noise)
    directory="tuning",
    project_name="cifar10_day7",
    overwrite=True,
)
tuner.search_space_summary()

Search space summary
Default search space size: 3
units (Int)   {'min_value': 64, 'max_value': 256, 'step': 64}
dropout (Float) {'min_value': 0.2, 'max_value': 0.5, 'step': 0.1}
lr (Float)    {'min_value': 0.0001, 'max_value': 0.01, 'sampling': 'log'}

search() has the exact signature of fit() — datasets, epochs, callbacks all pass straight through. Keep search epochs short: we’re ranking configurations, not producing the final model. Early stopping kills hopeless trials even sooner:

tuner.search(
    train_ds,
    validation_data=val_ds,
    epochs=8,
    callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)],
    verbose=2,
)

best_hp = tuner.get_best_hyperparameters(num_trials=1)[0]
print(best_hp.values)

Trial 10 Complete [00h 01m 42s]
val_accuracy: 0.7862
Best val_accuracy So Far: 0.7994
Total elapsed time: 00h 17m 05s

{'units': 192, 'dropout': 0.4, 'lr': 0.0021387}

Everything is checkpointed under tuning/cifar10_day7/ — kill the process mid-search and rerunning the same script (with overwrite=False) resumes where it stopped. RandomSearch is the honest baseline tuner and surprisingly hard to beat with only ~10 trials; when budgets grow, swap in kt.Hyperband (aggressive early-killing of weak trials) or kt.BayesianOptimization (models the objective surface) — the API is identical, only the class name changes.

Why random search rather than a grid? With three hyperparameters and 10 trials, a grid gives you ~2 distinct values per axis; random search gives you 10 distinct values on every axis. Since usually one hyperparameter (here, LR) dominates the outcome, random search explores the important dimension five times more finely for the same cost.

The upgraded CNN, head to head

Time to collect the winnings. Take the best hyperparameters, wrap the discovered LR into the cosine schedule as its warmup_target, and train the full 30 epochs:

units = best_hp["units"]      # 192
drop  = best_hp["dropout"]    # 0.4
peak  = best_hp["lr"]         # 0.0021

lr_schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.0,
    warmup_target=peak,
    warmup_steps=3 * steps_per_epoch,
    decay_steps=(EPOCHS - 3) * steps_per_epoch,
    alpha=0.02,
)

final = build_model(best_hp)                 # rebuild with winning hp
final.compile(                                # recompile to swap in the schedule
    optimizer=keras.optimizers.Adam(lr_schedule),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
hist_final = final.fit(train_ds, validation_data=val_ds, epochs=EPOCHS, verbose=2)

print("baseline :", baseline.evaluate(x_test, y_test, verbose=0))
print("upgraded :", final.evaluate(x_test, y_test, verbose=0))

Representative numbers from one run (a free Colab T4, mixed precision on):

Model	Train acc	Val acc	Test acc	Train−val gap	Time / epoch
Day-6 baseline	98.7%	71.9%	71.2%	26.8 pts	11 s
+ BN, Dropout, L2	88.4%	80.1%	79.6%	8.3 pts	12 s
+ CosineDecay + tuned hp	91.2%	83.0%	82.5%	8.2 pts	12 s
+ mixed precision	91.0%	82.8%	82.4%	8.2 pts	7 s

Read the table like a story. Regularization lowered training accuracy by ten points and raised test accuracy by eight — the model traded memorization for generalization, which is exactly the trade we wanted. The schedule and tuned hyperparameters bought roughly three more points. Mixed precision changed accuracy by noise (±0.2) while cutting epoch time by ~40%. And the train–val gap collapsed from 27 points to 8: the same architecture, no longer memorizing.

If your absolute numbers differ by a couple of points, that’s expected — different GPU, different shuffle seed, different random-search draws. The shape of the result (gap collapses, val accuracy climbs, speed improves) is what should reproduce.

🧪 Your task

The search above kept the L2 coefficient frozen at 1e-4 and the conv-block filter counts frozen at 32/64. Extend the tuner: add l2 as a log-sampled hp.Float between 1e-5 and 1e-3, and add a width choice — hp.Choice("width", [32, 48, 64]) — that sets the first block’s filters (second block gets 2 * width). Run a fresh RandomSearch with max_trials=12, print the best hyperparameters, and report whether wider-but-more-regularized beats narrower-but-freer on your hardware.

Hint: every hp.* call must live inside build_model — the tuner re-invokes the function per trial, and hyperparameters declared outside it are invisible to the search. Use a new project_name (or overwrite=True) so old trial checkpoints don’t pollute the new search space.

Solution

import keras
from keras import layers
import keras_tuner as kt

def build_model_v2(hp):
    width = hp.Choice("width", [32, 48, 64])
    l2    = hp.Float("l2", 1e-5, 1e-3, sampling="log")
    units = hp.Int("units", 64, 256, step=64)
    drop  = hp.Float("dropout", 0.2, 0.5, step=0.1)
    lr    = hp.Float("lr", 1e-4, 1e-2, sampling="log")

    inputs = layers.Input(shape=(32, 32, 3))
    x = conv_block(inputs, width, l2=l2, drop=0.2)
    x = conv_block(x, 2 * width, l2=l2, drop=0.2)
    x = layers.Flatten()(x)
    x = layers.Dense(units, use_bias=False,
                     kernel_regularizer=keras.regularizers.L2(l2))(x)
    x = layers.BatchNormalization()(x)
    x = layers.Activation("relu")(x)
    x = layers.Dropout(drop)(x)
    outputs = layers.Dense(10, dtype="float32")(x)

    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(lr),
        loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=["accuracy"],
    )
    return model

tuner_v2 = kt.RandomSearch(
    build_model_v2,
    objective="val_accuracy",
    max_trials=12,
    directory="tuning",
    project_name="cifar10_day7_v2",   # fresh project → fresh search space
    overwrite=True,
)

tuner_v2.search(
    train_ds,
    validation_data=val_ds,
    epochs=8,
    callbacks=[keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)],
    verbose=2,
)

best = tuner_v2.get_best_hyperparameters(1)[0]
print(best.values)
# Typical winner: {'width': 64, 'l2': 0.00021, 'units': 192,
#                  'dropout': 0.4, 'lr': 0.0018}

# Retrain the winner properly, with the cosine schedule around its LR:
schedule = keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=0.0,
    warmup_target=best["lr"],
    warmup_steps=3 * steps_per_epoch,
    decay_steps=27 * steps_per_epoch,
    alpha=0.02,
)
winner = build_model_v2(best)
winner.compile(
    optimizer=keras.optimizers.Adam(schedule),
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=["accuracy"],
)
winner.fit(train_ds, validation_data=val_ds, epochs=30, verbose=2)
print(winner.evaluate(x_test, y_test, verbose=0))

On most runs the tuner picks the widest network (width=64) paired with stronger L2 (~2e-4) — capacity plus regularization beats a small unregularized model. That pairing is the deep-learning-era answer to the bias–variance trade-off: don’t shrink the model, constrain it.

Key takeaways

Overfitting shows up as a train–val gap; every tool today attacks that gap, not raw training loss.
Conv block ordering: Conv (use_bias=False) → BatchNorm → ReLU; Dropout goes after pooling or in the dense head, never directly before BatchNorm.
Keras manages the training=True/False flag inside fit/evaluate automatically — unlike PyTorch’s manual model.train()/model.eval().
L2 lives per layer via kernel_regularizer; the penalties accumulate in model.losses, which fit() sums for you but a custom loop must add explicitly.
CosineDecay with built-in warmup replaces LR folklore; decay_steps is in steps, not epochs — compute steps_per_epoch and multiply.
keras.mixed_precision.set_global_policy("mixed_float16") is the whole mixed-precision story, plus dtype="float32" on the final layer.
keras_tuner.RandomSearch turns hp.Int/hp.Float declarations into a resumable search; sample learning rates log-uniformly, search short, retrain the winner long.

Tomorrow: why train convolutional features from scratch at all — we bolt a pretrained ImageNet backbone onto our pipeline with Keras Applications and get better accuracy in a fraction of the epochs.

🏠 📊 Course home | ← Day 06 | Day 08 → | 📚 All mini-courses