📊 Deep Learning with TensorFlow & Keras · Lesson 1 — Tensors, Variables & GradientTape

🏠 📊 Course home | Lesson 02 → | 📚 All mini-courses

Lesson 1 — Tensors, Variables & GradientTape

Welcome to Lesson 1. Before Keras gives you model.fit() and all its comforts, there is a small, sharp core underneath: tensors (immutable data), variables (mutable state), and GradientTape (automatic differentiation). Everything Keras does — every layer, every optimizer step — reduces to these three things. In this lesson you’ll learn them properly, and prove it to yourself by fitting a line to data using nothing but a tape and a subtraction. If you’ve been through the PyTorch mini-course on this site, you’ll feel the déjà vu — and we’ll point out exactly where TensorFlow makes different design choices, because those differences are where most cross-framework bugs come from.

🎯 In this lesson you will: manipulate tf.Tensor shapes and dtypes with confidence, hold trainable state in tf.Variable, compute exact gradients with tf.GradientTape (including persistent tapes and watch), understand when @tf.function traces and retraces, and fit \(y = wx + b\) by hand with gradient descent.

Tensors: immutable, typed, shaped

A tf.Tensor is a multidimensional array with two non-negotiable properties: a dtype and a shape. Unlike a NumPy array, it’s immutable — you never write into a tensor; every op produces a new one. Let’s start at the bottom.

import tensorflow as tf
import numpy as np

print(tf.__version__)          # 2.x — everything today assumes TF 2

scalar = tf.constant(7)                     # rank 0
vector = tf.constant([1.0, 2.0, 3.0])       # rank 1
matrix = tf.constant([[1, 2], [3, 4]])      # rank 2

print(scalar.shape, scalar.dtype)   # ()      <dtype: 'int32'>
print(vector.shape, vector.dtype)   # (3,)    <dtype: 'float32'>
print(matrix.shape, matrix.dtype)   # (2, 2)  <dtype: 'int32'>

Two inference rules worth memorizing, because they differ from NumPy and from PyTorch:

A Python int becomes int32 (NumPy would give you int64).
A Python float becomes float32 (NumPy would give you float64).

Float32 is the deep-learning default — half the memory of float64 and what GPUs are optimized for — so TF’s inference is actually doing you a favor. But it sets up the single most common beginner error:

a = tf.constant(1)      # int32
b = tf.constant(1.0)    # float32
try:
    a + b
except tf.errors.InvalidArgumentError as e:
    print(type(e).__name__)   # InvalidArgumentError

TensorFlow does not silently promote dtypes across an op. PyTorch would happily give you 2.0 here via type promotion; TF refuses. The fix is an explicit cast:

tf.cast(a, tf.float32) + b    # <tf.Tensor: shape=(), dtype=float32, numpy=2.0>

This strictness feels annoying for exactly one afternoon, then it starts catching real bugs — a stray int64 index tensor leaking into your loss computation, for instance.

Shape surgery is the other daily skill. Three tools cover 95% of real usage:

x = tf.range(12)                     # shape (12,)   [0, 1, ..., 11]

r = tf.reshape(x, (3, 4))            # (3, 4) — total elements must match
r2 = tf.reshape(x, (3, -1))          # -1 means "infer this axis" → (3, 4)

col = x[:, tf.newaxis]               # (12, 1) — insert an axis, like None in NumPy
row = tf.expand_dims(x, axis=0)      # (1, 12) — same idea, function form

flat = tf.reshape(r, (-1,))          # back to (12,)
print(r.shape, col.shape, row.shape) # (3, 4) (12, 1) (1, 12)

tf.reshape never copies data if it can avoid it — it’s a view of the same buffer with new shape metadata. What it cannot do is reorder elements; for that you want tf.transpose. Confusing the two is a classic silent bug: tf.reshape(m, (4, 3)) on a (3, 4) matrix produces valid-looking garbage, while tf.transpose(m) gives you the actual transpose.

Finally, crossing the NumPy boundary is cheap and explicit:

n = np.arange(6.0).reshape(2, 3)
t = tf.convert_to_tensor(n)     # NumPy → Tensor (note: inherits float64!)
back = t.numpy()                # Tensor → NumPy (eager mode only)
print(t.dtype)                  # <dtype: 'float64'>  ← cast if you care

That float64 inheritance bites people: data loaded via NumPy defaults to float64, your model is float32, and you get the InvalidArgumentError from above deep inside a layer. Cast at the boundary: tf.convert_to_tensor(n, dtype=tf.float32).

Variables: the state that training updates

Tensors are immutable, but training is all about mutation — nudging weights downhill every step. That’s what tf.Variable is for: a mutable wrapper around a tensor, with in-place update methods.

w = tf.Variable(3.0)
b = tf.Variable([1.0, 2.0], name="bias")

print(w)              # <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=3.0>
print(b.trainable)    # True — GradientTape will track it automatically

w.assign(5.0)         # replace the value
w.assign_add(1.0)     # w += 1  → 6.0
w.assign_sub(2.0)     # w -= 2  → 4.0

Three rules govern variables:

assign*, never =. Writing w = w - 0.1 * grad doesn’t update the variable — it rebinds the Python name w to a brand-new tensor, and your actual variable (the one the tape tracks, the one the optimizer knows about) is orphaned. w.assign_sub(0.1 * grad) mutates in place. This is the #1 way manual training loops silently stop learning.
Shape and dtype are fixed at creation. w.assign([1.0, 2.0]) on a scalar variable raises. Variables are allocated storage; you update contents, not structure.
trainable=True is the default, and it’s the hook the whole framework hangs on: GradientTape auto-watches trainable variables, and Keras collects them into model.trainable_variables. Set trainable=False for things like batch-norm moving averages that update by other means.

If you’re coming from PyTorch: tf.Variable plays the role of nn.Parameter + requires_grad=True, but it’s a first-class citizen you create directly, not a wrapper you register on a module. There is no .data back-door — mutation goes through assign, which keeps the bookkeeping consistent.

# A variable participates in ops just like a tensor:
y = w * 2.0 + b            # broadcasting: scalar*scalar + (2,) → (2,)
print(y.numpy())           # [9. 10.]

Math ops and broadcasting

Elementwise math looks exactly like NumPy — +, -, *, /, ** are all overloaded — and matrix multiplication is @ (or tf.matmul). The interesting part is broadcasting: how TF combines tensors of different shapes without copying data.

The rule, right-aligned: compare shapes from the last axis backwards; two axes are compatible if they’re equal or one of them is 1; size-1 axes stretch (virtually) to match. Missing leading axes are treated as 1.

a: (3, 1) 0 10 20 axis 1 stretches →

b: (1, 4) 0 1 2 3 axis 0 stretches ↓

result: (3, 4) 0123 10111213 20212223 no data was copied

a = tf.constant([[0.0], [10.0], [20.0]])   # (3, 1)
b = tf.constant([[0.0, 1.0, 2.0, 3.0]])    # (1, 4)
print((a + b).shape)                       # (3, 4) — exactly the picture above

Broadcasting is why a bias vector of shape (units,) adds cleanly onto a batch of activations of shape (batch, units): right-aligned, units == units, and the missing batch axis of the bias is treated as 1. It’s also why the most insidious shape bug exists:

y_true = tf.random.normal((32,))       # (32,)
y_pred = tf.random.normal((32, 1))     # (32, 1) — a model output, say
err = y_true - y_pred
print(err.shape)                       # (32, 32)  ← !!! not (32,)

No error, no warning — (32,) vs (32, 1) broadcasts to a (32, 32) outer difference, and tf.reduce_mean(err**2) happily returns a scalar that means nothing. Your loss goes down; your model learns garbage. Habit to build now: print or assert shapes on both sides of every loss computation. We’ll respect this habit in the capstone.

Reductions round out the toolkit:

m = tf.constant([[1.0, 2.0, 3.0],
                 [4.0, 5.0, 6.0]])          # (2, 3)
tf.reduce_mean(m)                # 3.5        — all elements, shape ()
tf.reduce_sum(m, axis=0)         # [5. 7. 9.] — collapse rows, shape (3,)
tf.reduce_max(m, axis=1)         # [3. 6.]    — collapse cols, shape (2,)

The mnemonic: axis=k means axis \(k\) disappears from the output shape.

GradientTape: autodiff on demand

Here is the philosophical fork between the frameworks. PyTorch builds a computation graph implicitly whenever a requires_grad tensor flows through an op, always. TensorFlow records operations only inside an explicit context — the tf.GradientTape. Nothing outside the with block is differentiable. Think of it literally as a tape recorder: ops executed inside the block get recorded; tape.gradient() plays the tape backwards applying the chain rule.

flowchart LR
    subgraph FWD["Forward pass — inside 'with tf.GradientTape() as tape:'"]
        X[("x = Variable(3.0)")] --> OP1["square"] --> Y["y = x² = 9"]
        OP1 -. "recorded on tape" .-> TAPE[("🎞️ tape")]
    end
    subgraph BWD["Backward pass — tape.gradient(y, x)"]
        TAPE --> REPLAY["replay in reverse,<br/>apply chain rule"] --> G["dy/dx = 2x = 6.0"]
    end
    Y --> REPLAY

The minimal example, worth running by hand once in your life:

x = tf.Variable(3.0)

with tf.GradientTape() as tape:
    y = x ** 2          # recorded

dy_dx = tape.gradient(y, x)
print(dy_dx)            # tf.Tensor(6.0, ...)  — d(x²)/dx = 2x = 6 at x=3

Key mechanics, in the order they’ll bite you:

1. Trainable variables are watched automatically; plain tensors are not.

c = tf.constant(3.0)
with tf.GradientTape() as tape:
    y = c ** 2
print(tape.gradient(y, c))     # None  ← not an error, just None

A silent None — TF assumes constants are data, not parameters. If you genuinely want the gradient with respect to a tensor (computing input gradients for adversarial examples, saliency maps, etc.), tell the tape explicitly:

with tf.GradientTape() as tape:
    tape.watch(c)              # "record ops involving c too"
    y = c ** 2
print(tape.gradient(y, c))     # tf.Tensor(6.0, ...)

2. A tape is consumed after one gradient() call. By default the recorded tape is freed the moment you differentiate — same memory-saving logic as PyTorch freeing the graph after .backward(). Need two gradient calls from one forward pass? Make it persistent, and delete it when done:

x = tf.Variable(2.0)
with tf.GradientTape(persistent=True) as tape:
    y = x ** 2                 # dy/dx = 2x
    z = x ** 3                 # dz/dx = 3x²

print(tape.gradient(y, x))     # 4.0
print(tape.gradient(z, x))     # 12.0  ← second call, fine because persistent
del tape                       # release the recorded ops promptly

3. Differentiate with respect to many things at once. Pass a list (or any nested structure) of sources and get gradients back in the same structure — this is exactly the shape of every training loop you’ll ever write:

w = tf.Variable(tf.random.normal((3, 2)))
b = tf.Variable(tf.zeros(2))
x = tf.constant([[1.0, 2.0, 3.0]])            # (1, 3)

with tf.GradientTape() as tape:
    y = x @ w + b                              # (1, 2)
    loss = tf.reduce_mean(y ** 2)

grads = tape.gradient(loss, [w, b])
print(grads[0].shape, grads[1].shape)          # (3, 2) (2,) — same shapes as the sources

Gradients always have the same shape as their source variable. If they don’t, broadcasting already ate your loss (see the (32, 32) trap above).

4. Higher-order gradients are just nested tapes. One line of trivia that occasionally matters (e.g., gradient-penalty losses): wrap a tape in a tape, and the inner gradient() call is itself differentiable.

Eager by default, graphs by decoration: `@tf.function`

Everything so far ran eagerly — each op executed immediately, Python-debuggable, print works, .numpy() works. That’s TF 2’s default and it’s identical in spirit to PyTorch. But TF keeps its TF 1 superpower behind one decorator: @tf.function compiles a Python function into a graph — a portable, optimizable dataflow program — via a process called tracing.

@tf.function
def dense(x, w, b):
    print("tracing!")               # Python side effect — trace time only
    return tf.nn.relu(x @ w + b)

x = tf.random.normal((4, 3)); w = tf.random.normal((3, 2)); b = tf.zeros(2)

dense(x, w, b)      # prints "tracing!" — first call traces, then runs the graph
dense(x, w, b)      # prints nothing   — cached graph reused

On the first call, TF runs your Python code once with symbolic tensors to record the ops into a graph; every later call with compatible inputs skips Python entirely and executes the graph (fused, pruned, potentially parallelized). That’s where the speed comes from — and where all the confusion comes from. The rules:

flowchart TD
    CALL["call f(args)"] --> SIG{"seen this input<br/>signature before?"}
    SIG -- "yes (same dtypes/shapes,<br/>same Python values)" --> RUN["run cached graph<br/>(fast, no Python)"]
    SIG -- no --> TRACE["TRACE: run Python once,<br/>record ops into new graph"]
    TRACE --> CACHE[("graph cache")]
    CACHE --> RUN
    TRACE -. "Python side effects<br/>(print, list.append)<br/>happen HERE only" .-> WARN["⚠️ once per trace,<br/>not per call"]

Rule 1 — one trace per input signature. For tensor arguments, the signature is (dtype, shape). Same dtype and shape → cached graph. New shape → new trace. Fine, usually.

Rule 2 — Python values are baked in as constants. This is the big one. A Python scalar argument isn’t a tensor — it becomes part of the signature itself, so every distinct value triggers a full retrace:

@tf.function
def scale(x, k):
    print(f"tracing for k={k}")
    return x * k

t = tf.constant([1.0, 2.0])
scale(t, 2)    # tracing for k=2
scale(t, 3)    # tracing for k=3   ← retrace!
scale(t, 4)    # tracing for k=4   ← retrace!  (imagine k = step counter... 💀)

scale(t, tf.constant(2.0))   # traces once for "float32 scalar tensor"
scale(t, tf.constant(3.0))   # cached — same signature, no retrace

Passing a loop counter or learning rate as a Python number into a @tf.function is the classic way to make “graph mode” slower than eager — you pay compilation on every single call. Pass tensors, not Python scalars, for anything that varies. You can also pin the signature explicitly, which turns surprise retraces into loud errors and allows variable batch sizes:

@tf.function(input_signature=[tf.TensorSpec(shape=[None, 3], dtype=tf.float32)])
def forward(x):
    return tf.reduce_sum(x, axis=1)

forward(tf.random.normal((4, 3)))    # OK — None matches any batch size, one trace
forward(tf.random.normal((7, 3)))    # OK — same graph, no retrace

Rule 3 — Python side effects run at trace time, not run time. print(), appending to a list, incrementing a Python counter: all happen once per trace, then never again. Use tf.print() for something that should execute every call, and tf.Variable for state that should mutate every call. Corollary: create variables outside the function — creating a tf.Variable inside a @tf.function on every call is an error by design.

When to use it: wrap your training step (you’ll see this in the capstone and again on Lesson 4), not tiny utility functions. PyTorch users: this is the moral equivalent of torch.compile, except tracing semantics — not bytecode analysis — define what gets captured, so data-dependent Python if statements on tensor values don’t work the way you’d hope (AutoGraph converts many of them to tf.cond, but that’s a Lesson 4 story).

Capstone: fitting y = wx + b with a bare tape

Time to assemble all four ideas — tensors for data, variables for parameters, a tape for gradients, assign_sub for updates — into the smallest possible learning system: linear regression by manual gradient descent. No Keras, no optimizer object. When this clicks, model.compile(optimizer='sgd', loss='mse') stops being magic forever.

Stage 1 — synthetic data. We pick a ground truth (\(w^\*=3\), \(b^\*=2\)), then hide it under noise:

tf.random.set_seed(42)

TRUE_W, TRUE_B = 3.0, 2.0
N = 200

x = tf.random.normal((N,))                          # (200,) inputs
noise = tf.random.normal((N,), stddev=0.5)
y = TRUE_W * x + TRUE_B + noise                     # (200,) targets

Everything stays rank-1 on purpose — one feature, no batch axis gymnastics — so the gradient math is legible. (Your exercise below adds the second dimension.)

Stage 2 — parameters and loss. Two scalar variables, initialized wrong on purpose, and mean-squared error:

\[\mathcal{L}(w, b) = \frac{1}{N}\sum_{i=1}^{N}\bigl(w x_i + b - y_i\bigr)^2\]

w = tf.Variable(0.0)    # start far from 3.0
b = tf.Variable(0.0)    # start far from 2.0

def predict(x):
    return w * x + b                     # broadcasting: scalar*(200,) + scalar → (200,)

def mse(y_true, y_pred):
    tf.debugging.assert_shapes([(y_true, ('N',)), (y_pred, ('N',))])  # the habit!
    return tf.reduce_mean(tf.square(y_true - y_pred))

That assert_shapes line is our insurance against the (32, 32) broadcasting trap from earlier — if a stray axis ever sneaks in, we fail loudly instead of learning garbage.

Stage 3 — one gradient-descent step. Forward under the tape, gradients out, assign_sub down the slope:

LR = 0.1

def train_step():
    with tf.GradientTape() as tape:
        loss = mse(y, predict(x))                 # forward pass: recorded
    dw, db = tape.gradient(loss, [w, b])          # backward pass: chain rule
    w.assign_sub(LR * dw)                         # w ← w − lr·∂L/∂w
    b.assign_sub(LR * db)                         # b ← b − lr·∂L/∂b
    return loss

Note what’s inside the with block: only the forward computation. The gradient call and the updates live outside — putting assign_sub inside the tape wastes memory recording ops you’ll never differentiate through, and in more complex setups can record spurious dependencies.

For calibration, here’s what the tape is computing analytically — you could verify dw against this by hand:

\[\frac{\partial \mathcal{L}}{\partial w} = \frac{2}{N}\sum_i (w x_i + b - y_i)\,x_i, \qquad \frac{\partial \mathcal{L}}{\partial b} = \frac{2}{N}\sum_i (w x_i + b - y_i)\]

Stage 4 — the loop.

for epoch in range(51):
    loss = train_step()
    if epoch % 10 == 0:
        print(f"epoch {epoch:3d}  loss={loss:.4f}  w={w.numpy():.3f}  b={b.numpy():.3f}")

Expected output (yours will match with the seed above, up to hardware rounding):

epoch   0  loss=13.3462  w=0.611  b=0.404
epoch  10  loss=0.5251  w=2.687  b=1.783
epoch  20  loss=0.2637  w=2.951  b=1.966
epoch  30  loss=0.2531  w=2.985  b=1.990
epoch  40  loss=0.2527  w=2.990  b=1.994
epoch  50  loss=0.2526  w=2.990  b=1.994

The loss floors near \(0.25 = \sigma_{\text{noise}}^2\) — we can’t beat the noise we injected, which is exactly right — and \((w, b)\) lands on the hidden \((3, 2)\). You just trained a model with four concepts and zero framework machinery.

Stage 5 — graph-compile the step. One decorator, everything else unchanged:

@tf.function
def train_step_fast():
    with tf.GradientTape() as tape:
        loss = mse(y, predict(x))
    dw, db = tape.gradient(loss, [w, b])
    w.assign_sub(LR * dw)
    b.assign_sub(LR * db)
    return loss

GradientTape works inside @tf.function — the differentiation logic gets traced into the graph like any other op. Note the function takes no Python-scalar arguments that vary (rule 2), reads x, y by closure, and mutates w, b through variables (rule 3): a model citizen of tracing. On a toy this size the speedup is modest; on a real model the per-step Python overhead you’re deleting is substantial. This exact pattern — tape inside a decorated step function — is the skeleton of every custom training loop you’ll write on Lesson 4.

🧪 Your task

Extend the capstone to two features: fit \(y = w_1 x_1 + w_2 x_2 + b\) with ground truth \(\mathbf{w}^\* = (3, -2)\), \(b^\* = 1\). Requirements:

Generate X of shape (200, 2) and targets y of shape (200,) (use X @ true_w + TRUE_B + noise).
Use a single variable w = tf.Variable(tf.zeros(2)) — not two scalars — and X @ w in the prediction.
Wrap the training step in @tf.function and confirm (with a print inside) that it traces exactly once.
Recover the true parameters to within ~0.05.

Hint: X @ w needs w of shape (2,) and produces shape (200,) — check that your prediction and target shapes match before the loop, or the broadcasting trap from the ops section will produce a (200, 200) error matrix and a loss that “converges” to nonsense.

Solution

import tensorflow as tf

tf.random.set_seed(0)

TRUE_W = tf.constant([3.0, -2.0])
TRUE_B = 1.0
N = 200

X = tf.random.normal((N, 2))                              # (200, 2)
noise = tf.random.normal((N,), stddev=0.5)
y = X @ TRUE_W + TRUE_B + noise                           # (200,2)@(2,) → (200,)

w = tf.Variable(tf.zeros(2))                              # (2,)
b = tf.Variable(0.0)
LR = 0.1

@tf.function
def train_step():
    print("tracing!")                                     # must appear exactly once
    with tf.GradientTape() as tape:
        y_pred = X @ w + b                                # (200,)
        tf.debugging.assert_shapes([(y_pred, ('N',)), (y, ('N',))])
        loss = tf.reduce_mean(tf.square(y - y_pred))
    dw, db = tape.gradient(loss, [w, b])                  # dw: (2,), db: ()
    w.assign_sub(LR * dw)
    b.assign_sub(LR * db)
    return loss

for epoch in range(101):
    loss = train_step()
    if epoch % 25 == 0:
        print(f"epoch {epoch:3d}  loss={loss:.4f}  w={w.numpy()}  b={b.numpy():.3f}")

# sanity check — fails loudly if learning broke
assert abs(b.numpy() - TRUE_B) < 0.05
assert all(abs(w.numpy() - TRUE_W.numpy()) < 0.05)
print("recovered:", w.numpy(), b.numpy())

Expected: tracing! prints once (before epoch 0’s result), loss drops from ~14 to ~0.25 (the noise floor), and the final line reads approximately recovered: [ 2.99 -1.98] 0.99. If you see tracing! more than once, a Python value is leaking into the function’s signature; if the loss stalls high, print y_pred.shape — you’ve almost certainly got a (200, 1) vs (200,) mismatch.

Key takeaways

tf.Tensor is immutable, defaults to float32/int32, and never silently promotes dtypes — cast explicitly at boundaries (especially NumPy’s float64).
tf.Variable is trainable state: update with assign / assign_add / assign_sub, never by rebinding the Python name.
Broadcasting right-aligns shapes and stretches size-1 axes; the (N,) vs (N,1) mismatch in a loss is silent and deadly — assert shapes.
tf.GradientTape records ops only inside its context; it auto-watches trainable variables, needs tape.watch() for constants, and needs persistent=True for multiple gradient() calls.
@tf.function traces Python into a cached graph: one trace per input signature, Python scalars bake in (retrace per value — pass tensors), Python side effects run at trace time only.
Training = forward under a tape → tape.gradient → assign_sub. Everything Keras adds from here is convenience around that loop.

In the next lesson: the same model three ways — Sequential, Functional, and subclassed — and how Keras 3 decides which API you actually need.

🏠 📊 Course home | Lesson 02 → | 📚 All mini-courses

Lesson 1 — Tensors, Variables & GradientTape

Tensors: immutable, typed, shaped

Variables: the state that training updates

Math ops and broadcasting

GradientTape: autodiff on demand

Eager by default, graphs by decoration: @tf.function

Capstone: fitting y = wx + b with a bare tape

🧪 Your task

Key takeaways

Eager by default, graphs by decoration: `@tf.function`