flowchart LR
A[("Raw data<br/>CSV / images / DB")] --> B["Dataset<br/>__len__ · __getitem__<br/>(one sample at a time)"]
B --> C["Sampler<br/>draws indices<br/>(shuffled each epoch)"]
C --> D["Worker processes<br/>num_workers × __getitem__"]
D --> E["collate_fn<br/>stack samples → batch tensors"]
E --> F["pin_memory<br/>page-locked staging"]
F --> G[["Training loop<br/>for xb, yb in loader"]]
🔥 Deep Learning with PyTorch · Lesson 3 — Feeding the Model: Dataset & DataLoader
🏠 🔥 Course home | ← Lesson 02 | Lesson 04 → | 📚 All mini-courses
Lesson 3 — Feeding the Model: Dataset & DataLoader
In the previous lesson you built an nn.Module and called it on tensors you created by hand. That works for a demo, but real training data doesn’t arrive as one convenient tensor sitting in memory — it lives in CSV files, image folders, databases, and it’s usually too big, too messy, or too slow to load all at once. In this lesson you build the pipe that connects raw data on disk to the forward() you wrote in the previous lesson: PyTorch’s Dataset and DataLoader. By the end you’ll have a complete tabular pipeline — CSV file → custom Dataset → train/val split → shuffled, batched, GPU-ready mini-batches — which is exactly what Lesson 4’s training loop will consume.
🎯 In this lesson you will: write a custom Dataset with __len__ and __getitem__, apply transforms without leaking validation statistics, split data with random_split, configure a DataLoader (batching, shuffling, num_workers, pin_memory), and assemble an end-to-end tabular pipeline that feeds in the previous lesson’s model.
Why two classes? The division of labor
PyTorch splits data loading into two responsibilities, and the split is the whole design:
Datasetanswers exactly two questions: how many samples do you have? and give me sample numberi. It knows nothing about batching, shuffling, or parallelism.DataLoadertakes anyDatasetand handles everything else: drawing indices (shuffled or not), fetching samples (possibly in parallel worker processes), stacking them into batches, and staging them for the GPU.
This separation is why the same DataLoader machinery works for images, text, audio, and tabular rows — your job is only ever to implement the two-method Dataset contract, and you get batching, shuffling, and multiprocessing for free.
Keep this picture in mind: everything in this lesson’s lesson is one box in that diagram.
The Dataset contract: __len__ and __getitem__
A Dataset is any class that subclasses torch.utils.data.Dataset and implements two dunder methods. Here is the smallest one that could possibly work:
import torch
from torch.utils.data import Dataset
class SquaresDataset(Dataset):
"""Maps i -> (i, i**2). Deliberately trivial to expose the contract."""
def __init__(self, n: int):
self.n = n
def __len__(self) -> int:
return self.n
def __getitem__(self, idx: int):
x = torch.tensor([float(idx)])
y = torch.tensor([float(idx) ** 2])
return x, y
ds = SquaresDataset(5)
print(len(ds)) # 5
print(ds[3]) # (tensor([3.]), tensor([9.]))Three rules govern __getitem__, and violating any of them is a classic source of silent bugs:
- It must be a pure function of
idx. Callingds[3]twice must return the same thing (random augmentation is the sanctioned exception — more below). If__getitem__mutates state, parallel workers will each mutate their own copy of that state and you’ll get inconsistent, unreproducible behavior. - It returns one sample, not a batch. The
DataLoaderstacks samples for you. If you return shape(1, 8)“to be safe”, your batches come out(batch, 1, 8)instead of(batch, 8)and in the previous lesson’snn.Linear(8, ...)will either crash or — worse — broadcast in a way that trains garbage. - Return tensors (or things collatable into tensors). Floats, ints, numpy arrays, and tensors all collate fine; arbitrary Python objects need a custom
collate_fn.
The index-based contract is called a map-style dataset. (There is also IterableDataset for streaming sources with no known length — you won’t need it for anything in this course.)
A real pipeline: tabular data from CSV
Time to build the real thing. We’ll manufacture a realistic tabular regression problem — predicting a house price from 8 numeric features — write it to a CSV so the pipeline starts from disk like it would in practice, and then load it properly.
Stage 1: create the raw data file
import numpy as np
import pandas as pd
rng = np.random.default_rng(42)
N = 5_000
# 8 features on wildly different scales -- on purpose.
df = pd.DataFrame({
"sqft": rng.normal(1800, 600, N).clip(300),
"bedrooms": rng.integers(1, 6, N).astype(float),
"bathrooms": rng.integers(1, 4, N).astype(float),
"age_years": rng.uniform(0, 90, N),
"lot_sqft": rng.normal(9000, 3000, N).clip(500),
"garage": rng.integers(0, 3, N).astype(float),
"dist_center": rng.exponential(8.0, N), # km to city center
"crime_idx": rng.uniform(0, 100, N),
})
# Ground-truth relationship + noise (the model must rediscover this)
price = (
150 * df["sqft"] + 12_000 * df["bathrooms"] + 8_000 * df["bedrooms"]
- 900 * df["age_years"] + 2.5 * df["lot_sqft"] + 15_000 * df["garage"]
- 4_000 * df["dist_center"] - 700 * df["crime_idx"]
+ rng.normal(0, 25_000, N)
)
df["price"] = price.clip(20_000)
df.to_csv("houses.csv", index=False)
print(df.head(3).round(1)) sqft bedrooms bathrooms age_years lot_sqft garage dist_center crime_idx price
0 1982.9 4.0 1.0 67.9 10464.7 1.0 3.5 76.2 310509.4
1 1176.0 2.0 3.0 25.3 8532.7 2.0 6.6 54.9 270527.1
2 2250.3 5.0 2.0 48.2 5310.6 0.0 1.5 41.6 345254.6
Note the scales: sqft lives around 1800, crime_idx around 50, garage around 1. Feed these raw into a network and gradient descent has a miserable time — the loss surface is a stretched-out ravine because a weight on sqft gets gradients ~1800× larger than a weight on garage. The fix is standardization:
\[ z = \frac{x - \mu}{\sigma} \]
Every feature ends up with mean 0 and standard deviation 1, so all weights see gradients of comparable magnitude. We’ll standardize the target too (its values are in the hundreds of thousands, which would make the MSE loss astronomically large and the early gradients explosive).
Stage 2: the custom Dataset
import torch
from torch.utils.data import Dataset
class HousesDataset(Dataset):
"""Tabular dataset backed by a CSV. Holds raw tensors; normalization
stats are injected AFTER the train/val split (see next section)."""
def __init__(self, csv_path: str):
df = pd.read_csv(csv_path)
# float32, not float64: nn.Linear weights are float32 by default,
# and float64 inputs would raise a dtype mismatch in forward().
self.X = torch.tensor(df.drop(columns="price").values, dtype=torch.float32)
self.y = torch.tensor(df["price"].values, dtype=torch.float32).unsqueeze(1)
self.feature_names = list(df.columns[:-1])
self.x_mean = self.x_std = None # set later, from TRAIN data only
self.y_mean = self.y_std = None
def set_stats(self, x_mean, x_std, y_mean, y_std):
self.x_mean, self.x_std = x_mean, x_std
self.y_mean, self.y_std = y_mean, y_std
def __len__(self) -> int:
return self.X.shape[0]
def __getitem__(self, idx: int):
x, y = self.X[idx], self.y[idx]
if self.x_mean is not None: # transform-on-read
x = (x - self.x_mean) / self.x_std
y = (y - self.y_mean) / self.y_std
return x, y
ds = HousesDataset("houses.csv")
x0, y0 = ds[0]
print(len(ds), x0.shape, x0.dtype, y0.shape) # 5000 torch.Size([8]) torch.float32 torch.Size([1])Design decisions worth pausing on:
- Load once, index cheaply. 5,000 × 9 floats is ~180 KB — it fits in RAM thousands of times over, so we parse the CSV once in
__init__and__getitem__is just tensor indexing. If your table had 500 million rows you’d instead memory-map or read chunks lazily in__getitem__; the contract doesn’t change, only where the I/O happens. unsqueeze(1)on the target.df["price"].valuesgives shape(N,); the model’s output will be(batch, 1). If you leave the target as(batch,), MSE loss will broadcast(batch,1)against(batch,)into a(batch, batch)matrix and quietly compute nonsense. This is the single most common tabular-PyTorch bug; PyTorch even warns about it — don’t ignore that warning.- Normalization applied in
__getitem__, not baked intoself.X. This “transform-on-read” pattern is exactly how torchvision transforms work for images, and it lets us set the stats after splitting — which brings us to the leak.
Stage 3: random_split — and the data leak everyone commits once
We need a validation set the model never trains on, so we can detect overfitting on Lesson 4. random_split partitions a dataset into random, non-overlapping Subsets:
from torch.utils.data import random_split
train_ds, val_ds = random_split(
ds, [0.8, 0.2], # fractions work directly
generator=torch.Generator().manual_seed(42), # reproducible split
)
print(len(train_ds), len(val_ds)) # 4000 1000A Subset is a thin view: it stores a reference to the parent dataset plus a list of indices, and train_ds[i] simply calls ds[train_ds.indices[i]]. No data is copied.
Now the subtle part. Where do the normalization statistics \(\mu, \sigma\) come from? If you compute them over all 5,000 rows, information about the validation set leaks into the training pipeline — the train samples get scaled using numbers that partially describe validation data. For standardization the damage is usually small, but the habit is deadly (do the same with, say, target encoding and your validation score becomes fiction). The rule is absolute: fit statistics on train, apply everywhere.
train_idx = torch.tensor(train_ds.indices)
X_train_raw = ds.X[train_idx] # raw train features only
y_train_raw = ds.y[train_idx]
ds.set_stats(
x_mean=X_train_raw.mean(dim=0), # shape (8,)
x_std=X_train_raw.std(dim=0).clamp_min(1e-8), # guard: constant column -> std 0 -> div by zero -> NaN
y_mean=y_train_raw.mean(),
y_std=y_train_raw.std(),
)
x0, y0 = train_ds[0]
print(x0) # values now roughly in [-3, 3]tensor([ 0.3521, -1.1032, 1.2270, -0.6721, 0.4816, 1.1741, -0.2698, 0.1523])
Because both Subsets point at the same parent ds, setting stats once normalizes both — train stats applied to val samples, which is exactly what will happen at inference time with real unseen data. The clamp_min(1e-8) guard matters: one constant feature column gives \(\sigma = 0\), division produces inf/nan, and the first loss.backward() poisons every weight in the network. NaNs in the data pipeline are far more common than NaNs from the math.
What about random augmentation — the sanctioned impurity mentioned earlier? For tabular data a common trick is adding small Gaussian noise to training inputs. The clean pattern is a wrapper dataset, so augmentation touches only the train subset:
class NoisyDataset(Dataset):
"""Adds N(0, sigma) noise to features. Wrap the TRAIN subset only --
augmenting validation data would corrupt your measurement."""
def __init__(self, base: Dataset, sigma: float = 0.05):
self.base, self.sigma = base, sigma
def __len__(self):
return len(self.base)
def __getitem__(self, idx):
x, y = self.base[idx]
return x + torch.randn_like(x) * self.sigma, y
train_aug = NoisyDataset(train_ds, sigma=0.05)Composition over modification: NoisyDataset works on any dataset that returns (x, y), and the validation path is untouched.
DataLoader: from samples to mini-batches
The DataLoader is where single samples become the (batch, features) tensors your model actually consumes:
from torch.utils.data import DataLoader
train_loader = DataLoader(
train_aug,
batch_size=64,
shuffle=True, # reshuffle indices every epoch -- train only!
drop_last=True, # discard the final short batch
)
val_loader = DataLoader(
val_ds,
batch_size=256, # no backward pass -> no activation memory -> go bigger
shuffle=False, # order doesn't matter for evaluation; keep it deterministic
)
xb, yb = next(iter(train_loader))
print(xb.shape, yb.shape) # torch.Size([64, 8]) torch.Size([1])... no:torch.Size([64, 8]) torch.Size([64, 1])
Under the hood, one batch is produced like this: the sampler draws 64 indices (a fresh permutation each epoch because shuffle=True), __getitem__ is called 64 times yielding 64 (x, y) pairs, and the default collate function stacks them — 64 tensors of shape (8,) become one tensor of shape (64, 8), and 64 targets of shape (1,) become (64, 1):
The number of batches per epoch is \(\lceil N/B \rceil\) normally, or \(\lfloor N/B \rfloor\) with drop_last=True:
print(len(train_loader)) # 62 (floor(4000/64); last 32 samples dropped this epoch)
print(len(val_loader)) # 4 (ceil(1000/256); short final batch kept)Why the different settings per loader?
| Setting | Train loader | Val loader | Why |
|---|---|---|---|
shuffle |
True |
False |
SGD needs a different sample order each epoch to decorrelate consecutive gradient steps; evaluation is order-independent, and determinism aids debugging |
batch_size |
64 | 256 | Training stores activations for backward; eval under no_grad doesn’t, so memory allows much larger batches |
drop_last |
True |
False |
A tiny trailing batch (e.g. 3 samples) gives a noisy gradient — and crashes BatchNorm on Lesson 7 if it’s size 1. In eval, dropping samples would bias your metric |
Note that the shuffled samples are still dropped-in-different-epochs random: with shuffle=True the permutation changes every epoch, so the “last 32” discarded by drop_last are different rows each time — no sample is systematically excluded.
Feeding the GPU: num_workers and pin_memory
Two flags exist purely for throughput, and they matter once your GPU is fast enough to starve:
train_loader = DataLoader(
train_aug,
batch_size=64,
shuffle=True,
drop_last=True,
num_workers=4, # 4 subprocesses call __getitem__ in parallel
pin_memory=True, # allocate batches in page-locked RAM
persistent_workers=True, # keep workers alive across epochs
)num_workers=0(the default) means the main process fetches every sample: the GPU sits idle while Python decodes the next batch, then Python sits idle while the GPU computes.num_workers=4forks four worker processes that prefetch batches ahead of consumption, overlapping data prep with GPU compute. For our in-RAM tabular data the gain is negligible (indexing a tensor is already nanoseconds — herenum_workers=0is genuinely the right choice); for image datasets doing JPEG decode + augmentation per sample it’s routinely a 3–5× epoch speedup. Start at 4, tune by watching GPU utilization.pin_memory=Trueallocates batch tensors in page-locked (“pinned”) host RAM. CUDA can DMA from pinned memory asynchronously, which unlocksxb.to(device, non_blocking=True)— the copy overlaps with computation instead of blocking it. It does nothing useful on CPU-only runs.persistent_workers=Trueavoids paying worker startup cost at every epoch boundary — noticeable when epochs are short.
Two platform gotchas that will bite you if unmentioned: on macOS and Windows, workers start via spawn, which re-imports your script — so any script using num_workers > 0 must guard its entry point with if __name__ == "__main__": or it will fork-bomb itself. And each worker holds its own copy of the dataset object, so a dataset that loads 10 GB into RAM in __init__ costs 10 GB per worker — another reason big datasets read lazily in __getitem__.
End to end: the pipeline meets in the previous lesson’s model
Everything assembled, plus a minimal consumption loop proving the pipe delivers. (The proper training loop — metrics, checkpointing, schedulers — is Lesson 4’s whole agenda; this is deliberately bare.)
import torch
from torch import nn
from torch.utils.data import DataLoader, random_split
def build_pipeline(csv_path: str, batch_size: int = 64, seed: int = 42):
ds = HousesDataset(csv_path)
train_ds, val_ds = random_split(
ds, [0.8, 0.2], generator=torch.Generator().manual_seed(seed)
)
# Fit normalization on train rows only, apply globally.
idx = torch.tensor(train_ds.indices)
ds.set_stats(
ds.X[idx].mean(0), ds.X[idx].std(0).clamp_min(1e-8),
ds.y[idx].mean(), ds.y[idx].std(),
)
train_loader = DataLoader(train_ds, batch_size=batch_size,
shuffle=True, drop_last=True)
val_loader = DataLoader(val_ds, batch_size=4 * batch_size, shuffle=False)
return train_loader, val_loader, ds
if __name__ == "__main__": # required if you ever set num_workers > 0
device = "cuda" if torch.cuda.is_available() else "cpu"
train_loader, val_loader, ds = build_pipeline("houses.csv")
model = nn.Sequential( # Lesson 2's MLP, sized for 8 features
nn.Linear(8, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 1),
).to(device)
opt = torch.optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.MSELoss()
for epoch in range(3):
model.train()
for xb, yb in train_loader: # <- the pipeline, consumed
xb = xb.to(device, non_blocking=True)
yb = yb.to(device, non_blocking=True)
loss = loss_fn(model(xb), yb)
opt.zero_grad()
loss.backward()
opt.step()
model.eval()
with torch.no_grad():
val_loss = sum(
loss_fn(model(xb.to(device)), yb.to(device)).item() * len(xb)
for xb, yb in val_loader
) / len(val_loader.dataset)
print(f"epoch {epoch}: train_loss={loss.item():.4f} val_loss={val_loss:.4f}")epoch 0: train_loss=0.0511 val_loss=0.0623
epoch 1: train_loss=0.0343 val_loss=0.0489
epoch 2: train_loss=0.0338 val_loss=0.0462
Read the losses knowing the target is standardized: an MSE of 0.046 in \(z\)-space means the RMSE is \(\sqrt{0.046} \approx 0.21\) standard deviations of price — multiply back by ds.y_std (~$95k) and the model predicts within roughly ±$20k after three epochs. The pipeline works; the model learns.
One last structural point: notice that the training loop never mentions CSVs, normalization, splits, or shuffling. It sees only for xb, yb in loader. Swap HousesDataset for an image dataset in the next lesson and the loop doesn’t change a character. That interface stability is the payoff of this lesson’s work.
🧪 Your task
Time-series data breaks the “one row = one sample” assumption: a sample is a window of consecutive rows, and windows overlap. Write WindowDataset, a map-style dataset that wraps a 1-D tensor series and, for a window length w, returns (series[i : i+w], series[i+w]) — the window as input, the next value as the target. Concretely:
WindowDataset(torch.arange(10.0), w=3)must havelen == 7ds[0]returns(tensor([0., 1., 2.]), tensor(3.))ds[6]returns(tensor([6., 7., 8.]), tensor(9.)), andds[7]raisesIndexError
Then batch it with a DataLoader(batch_size=4) and confirm the batch shapes are (4, 3) and (4,).
Hint: the entire exercise is getting __len__ right. A series of length \(N\) with window \(w\) yields \(N - w\) valid windows — the target for the last window must still exist inside the series. Off-by-one here is invisible until ds[len(ds)-1] reads past the end.
Solution
import torch
from torch.utils.data import Dataset, DataLoader
class WindowDataset(Dataset):
def __init__(self, series: torch.Tensor, w: int):
assert series.ndim == 1, "expected a 1-D series"
assert len(series) > w, "series too short for even one window"
self.series, self.w = series, w
def __len__(self) -> int:
# N - w windows: window i covers [i, i+w), target at i+w,
# so the largest valid i is N - w - 1.
return len(self.series) - self.w
def __getitem__(self, idx: int):
if not 0 <= idx < len(self):
raise IndexError(idx)
x = self.series[idx : idx + self.w] # shape (w,)
y = self.series[idx + self.w] # scalar tensor, shape ()
return x, y
# --- checks ---
ds = WindowDataset(torch.arange(10.0), w=3)
assert len(ds) == 7
x, y = ds[0]
assert torch.equal(x, torch.tensor([0., 1., 2.])) and y.item() == 3.0
x, y = ds[6]
assert torch.equal(x, torch.tensor([6., 7., 8.])) and y.item() == 9.0
try:
ds[7]
raise AssertionError("ds[7] should have raised IndexError")
except IndexError:
pass
loader = DataLoader(ds, batch_size=4, shuffle=True)
xb, yb = next(iter(loader))
print(xb.shape, yb.shape) # torch.Size([4, 3]) torch.Size([4])
assert xb.shape == (4, 3) and yb.shape == (4,)
print("all checks passed")Note that the default collate turned 4 scalar targets of shape () into a batch of shape (4,) — for training you’d unsqueeze to (4, 1) to match a model’s (batch, 1) output, exactly as we did with price above. Also note what we didn’t do: no copies. Each __getitem__ returns a view into the original series, so 100,000 overlapping windows cost no more memory than the series itself.
Key takeaways
- A
Datasetis just__len__+__getitem__returning one sample; theDataLoaderowns batching, shuffling, parallel fetching, and memory pinning. random_splitgives cheap index-basedSubsetviews; seed the generator for a reproducible split.- Fit normalization statistics on the train split only — computing \(\mu, \sigma\) over all data leaks validation information; and clamp \(\sigma\) to dodge division-by-zero NaNs.
- Shapes are the contract: samples
(8,)collate to batches(64, 8); a target of(batch,)vs(batch, 1)silently broadcasts MSE into nonsense —unsqueeze(1)and heed the warning. - Train loader:
shuffle=True, drop_last=True; val loader:shuffle=False, bigger batches.num_workersoverlaps data prep with compute;pin_memory=True+non_blocking=Trueoverlaps the host→GPU copy — andnum_workers > 0demands anif __name__ == "__main__"guard on macOS/Windows. - Random augmentation belongs in a wrapper around the train subset only, never the validation path.
In the next lesson: the training loop, properly — train/eval modes, gradient accumulation, metric tracking, early stopping, and turning this lesson’s three-line loop into machinery you can trust.
🏠 🔥 Course home | ← Lesson 02 | Lesson 04 → | 📚 All mini-courses