Chapter 06 — 🧹 Data Preprocessing

📖 All chapters | ← 05 · 🌐 AI, ML & the Learning Process | 07 · 🗜️ Dimensionality Reduction →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Raw data is almost never ready for a model. It arrives with holes, wildly different scales, text where numbers belong, typos masquerading as outliers, and classes that show up a thousand times more often than others. Data preprocessing is the craft of turning that mess into clean numeric tensors a learning algorithm can actually use — and it routinely decides more of a project’s success than the choice of model. This chapter sits right after you understand the learning process (Chapter 05) and right before you start fitting real models (Regression onward).

🧭 In context: The ML Workflow · turning raw, dirty data into clean numeric features a model can learn from · garbage in, garbage out — the model is only as good as what you feed it.

💡 Remember this: Fit every transformation on the training set only, then apply it to validation and test — that single discipline, enforced by a pipeline, prevents the leakage that quietly inflates almost every preprocessing score.

Think of preprocessing like prepping ingredients before cooking. You wash, peel, chop, and measure before anything hits the pan. A great chef with dirty, unmeasured ingredients still cooks a bad meal; a modest cook with everything prepped properly turns out something solid. The model is the pan; this chapter is the prep.

The whole chapter follows one pipeline. Keep this map in mind; every section is a box in it.

flowchart LR
  A[Raw data] --> B[Clean & impute<br/>missing values]
  B --> C[Handle outliers]
  C --> D[Transform / discretize]
  D --> E[Encode categoricals]
  E --> F[Scale / normalize]
  F --> G[Fix class imbalance]
  G --> H[Extract & select<br/>features]
  H --> I[Model-ready matrix]

Here is the same pipeline as a little conveyor belt — raw data tumbles in dirty on the left, and a clean tile glides out polished on the right:

One rule governs the whole thing, so state it up front: fit every transformation on the training set only, then apply it to validation and test. Computing a mean, a scaler, or a SMOTE resampling using rows you will later evaluate on is data leakage — it leaks test information into training and inflates your score. We will repeat this warning where it bites hardest.

6.1 — Data cleaning & missing values

Real datasets have gaps: a sensor dropped out, a survey question was skipped, a join didn’t match. A missing value is an empty cell, and almost no model accepts one — you must either remove it or fill it. The right choice depends on why it is missing, so before touching a single cell, ask what produced the gap.

Statisticians name three mechanisms, and the names matter because each demands a different response. MCAR (Missing Completely At Random) means the gap is unrelated to anything — a random glitch, a dropped packet. MAR (Missing At Random) means missingness depends on other observed columns — for example, older users skip the income field, but since you can see their age, the pattern is explainable from what you have. MNAR (Missing Not At Random) means missingness depends on the missing value itself — high earners hide income because it is high. MCAR is safe to impute with a simple statistic. MNAR is dangerous, because the very fact that a value is missing carries information you should preserve rather than paper over; the standard defence is to add a “was-missing” indicator column so the model can learn from the absence itself.

A quick way to keep the three straight: ask “does the chance of the cell being empty depend on nothing, on something I can see, or on the hidden value itself?”

flowchart TD
  Q{Does missingness depend on...} --> A[nothing at all<br/>→ MCAR]
  Q --> B[other observed columns<br/>→ MAR]
  Q --> C[the missing value itself<br/>→ MNAR]
  A --> A2[simple impute is safe]
  B --> B2[impute using the related columns]
  C --> C2[keep a 'was-missing' flag;<br/>imputing alone hides signal]

The first fork in the road is deletion versus imputation. Deletion drops rows that have gaps (listwise deletion) or drops whole columns that are mostly empty. It is honest — you never invent data — but wasteful: drop a column that is only 5% missing and you throw away 95% of perfectly good values. Imputation fills each gap with a principled guess. The common guesses, and when each fits, are laid out below.

Method	Fills with	Good for	Pitfall
Mean	column average	numeric, symmetric	distorted by outliers; shrinks variance
Median	column middle	numeric, skewed	ignores relationships
Mode	most frequent	categorical	over-represents majority
KNN	average of k nearest rows	correlated features	slow; needs scaling first
Model-based	a model’s prediction	rich structure	can leak; complex

To see why the choice matters, work a tiny example. Suppose a column of ages reads [25, 30, NaN, 45, 1000]. The non-missing values are [25, 30, 45, 1000]. Their mean is \((25+30+45+1000)/4 = 275\) — an absurd “average age” dragged sky-high by the single 1000 typo. The median, by contrast, is the middle of the sorted list [25, 30, 45, 1000], namely \((30+45)/2 = 37.5\) — a sensible fill. This single comparison is why median is the default for any numeric column you suspect is messy.

The picture below makes the difference visible: the lone outlier yanks the mean marker far to the right, while the median marker stays planted among the bulk of the data.

import numpy as np
x = np.array([25, 30, np.nan, 45, 1000.])
fill = np.nanmedian(x)             # 37.5, ignores the NaN
x_imp = np.where(np.isnan(x), fill, x)
miss_flag = np.isnan(x).astype(int) # keep "was missing" as a feature

In real pipelines you almost never roll imputation by hand — scikit-learn’s SimpleImputer (and KNNImputer / IterativeImputer) learn the fill value on training data and reapply it on test, which is exactly the leakage-safe behaviour you want:

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

# the imputer learns the median from X_train only when fit inside a Pipeline
pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median", add_indicator=True)),  # adds was-missing flags
    ("clf", LogisticRegression()),
])
pipe.fit(X_train, y_train)   # median computed on train, reused on test automatically

Warning

Compute the imputation value (mean/median/mode) on the training set, then reuse that exact number on test data. Imputing with a statistic computed over the full dataset leaks information from test rows into training.

6.2 — Scaling, normalization & standardization

Features routinely arrive on incompatible scales: age lives in \([0,100]\) while salary lives in \([0,200000]\). Any algorithm that measures distance between points or sums weighted inputs will let salary dominate the calculation purely because its raw numbers are larger — not because it matters more. Scaling rewrites each feature onto a comparable range so that no feature wins by its unit of measurement alone.

The intuition: imagine judging which of two runners is “more unusual,” one timed in seconds and one in milliseconds. Unless you put both on the same footing, the millisecond runner’s numbers look enormous and drown out the other. Scaling is putting every feature on the same footing before they get compared.

There are three workhorses, and the intuition for each is worth holding separately. Min-max normalization squeezes a feature into the interval \([0,1]\): \[x' = \frac{x - \min}{\max - \min}\]

In words: subtract the smallest value, then divide by the full range, so the minimum maps to 0 and the maximum maps to 1. Also written: \(x' = \dfrac{x - x_{\min}}{x_{\max} - x_{\min}}\).

It is bounded and intuitive, but fragile: a single huge outlier stretches the max, which crushes every other value toward 0.

Z-score standardization instead recenters the feature to mean 0 and rescales it to standard deviation 1: \[z = \frac{x - \mu}{\sigma}\]

In words: measure how many standard deviations each value sits above or below the mean. Also written: \(z = (x - \bar{x})\,/\,s\), where \(\bar{x}\) is the sample mean and \(s\) the sample standard deviation.

The result is unbounded but robust to differences in scale, which makes it the sensible default for most models.

Robust scaling swaps in the median and the IQR (interquartile range — the spread of the middle 50% of the data, \(Q_3 - Q_1\)) in place of mean and standard deviation, so a few extreme points barely move it: \[x' = \frac{x - \text{median}}{\text{IQR}}\]

In words: center on the median and divide by the middle-50% spread, so extreme tails barely shift the result. Also written: \(x' = \dfrac{x - Q_2}{Q_3 - Q_1}\), where \(Q_2\) is the median.

Run the same salaries [40, 50, 60, 200] (in thousands) through all three to feel the difference. Min-max gives \((40-40)/(200-40)=0\), then \(0.0625\), \(0.125\), and \(1.0\) — the lone 200 pins the three real values into a tiny corner near zero. Z-score uses \(\mu = 87.5\) and \(\sigma \approx 65.7\), producing \(-0.72, -0.57, -0.42, 1.71\) — a much healthier spread. Robust scaling uses the median \(55\) and IQR \(= Q_3 - Q_1 = 72.5 - 47.5 = 25\), producing \(-0.6, -0.2, 0.2, 5.8\) — and notice that the outlier now sits visibly far out at \(5.8\), which is exactly what you want when an outlier is genuinely anomalous.

Which models actually need this? Distance-based and gradient-based models do, because they combine feature magnitudes directly. Tree-based models do not, because a tree only ever asks “is \(x < t\)?”, and the answer to that question is unchanged by any monotonic rescaling — the sort order of the values stays the same.

Needs scaling	Indifferent to scaling
KNN, K-means (distances)	Decision trees
SVM, logistic/linear regression	Random forests
Neural nets, PCA, gradient descent	Gradient-boosted trees

In practice scikit-learn gives you each scaler as a one-liner, and the pipeline guarantees the statistics are learned from training data only:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

scaler = StandardScaler()            # or MinMaxScaler(), RobustScaler()
X_train_s = scaler.fit_transform(X_train)  # learns mu, sigma from TRAIN
X_test_s  = scaler.transform(X_test)       # reuses the SAME mu, sigma

Tip

Default to z-score. Switch to robust scaling when the column has heavy outliers, and min-max when you specifically need a bounded \([0,1]\) range (e.g., image pixels, or a sigmoid output target).

6.3 — Categorical encoding

Models consume numbers, but a column like city = {Riyadh, Jeddah, Mecca} holds text. Encoding maps categories to numbers, and the whole art is doing so without inventing a fake order that the model will mistake for real structure.

Ordinal encoding simply assigns integers — Riyadh→0, Jeddah→1, Mecca→2. This is correct only when the categories genuinely rank, as with low < medium < high. On unordered cities it lies: it quietly tells the model that Mecca (2) is “twice” Jeddah (1) and that Jeddah sits “between” Riyadh and Mecca, relationships that do not exist.

One-hot encoding sidesteps the false order by creating one 0/1 column per category, so Jeddah → [0,1,0]. No category is numerically larger than another, which makes it the standard choice for low-cardinality nominal features (few distinct values). Its cost is width: a column of 10,000 distinct zip codes becomes 10,000 columns, which is wasteful and slow.

The contrast between the two is easiest to see — ordinal smears the three cities along one misleading axis (so 2 looks “bigger” than 0), while one-hot gives each its own clean switch with no false ranking:

Target (mean) encoding stays compact even at high cardinality by replacing each category with the mean of the target for that category — city = Riyadh becomes the average churn rate among Riyadh customers. It is powerful but exposes the label directly, so it overfits badly unless you compute the means out-of-fold (using cross-validation folds) and smooth small categories toward the global mean.

Two more options handle very high cardinality. Hashing runs each category string through a hash function into a fixed number of buckets — constant width, no dictionary to store, and unseen categories are handled automatically — at the price of occasional collisions that merge unrelated categories. Embeddings learn a short dense vector per category inside a neural net (Chapter 14), capturing genuine similarity between categories, and are the go-to when cardinality is very high and a deep model is already in play.

# one-hot from scratch
cats = ['Riyadh','Jeddah','Mecca','Riyadh']
vocab = sorted(set(cats))                       # ['Jeddah','Mecca','Riyadh']
onehot = [[int(c==v) for v in vocab] for c in cats]
# Riyadh -> [0,0,1]

# target encoding (smoothed), y = churn
import numpy as np
y = np.array([1,0,1,0]); m = 1.0               # smoothing strength
glob = y.mean()
def tgt(cat):
    mask = np.array([c==cat for c in cats]); n = mask.sum()
    return (y[mask].sum() + m*glob) / (n + m)   # shrink small groups to global

The smoothing term deserves a second look. With one Riyadh row whose churn is 1, raw mean encoding would assign Riyadh a confident 1.0; the smoothed formula \((1 + 1.0 \cdot 0.5)/(1 + 1) = 0.75\) instead, pulling that thinly-supported estimate back toward the global rate of 0.5. The fewer examples a category has, the harder it is shrunk — exactly the right behaviour.

That smoothed formula is worth stating in general: \[\hat{x}_{\text{cat}} = \frac{n_{\text{cat}}\,\bar{y}_{\text{cat}} + m\,\bar{y}}{n_{\text{cat}} + m}\]

In words: blend the category’s own target mean with the global mean, weighting the global mean by a smoothing strength \(m\) so tiny categories lean on the global rate. Also written: \(\hat{x}_{\text{cat}} = w\,\bar{y}_{\text{cat}} + (1-w)\,\bar{y}\) with the blend weight \(w = \dfrac{n_{\text{cat}}}{n_{\text{cat}} + m}\).

Encoding	Width	Cardinality fit	Risk
Ordinal	1	any	fake order
One-hot	C	low	column explosion
Target/mean	1	high	label leakage
Hashing	fixed	very high	collisions
Embedding	d (small)	very high	needs a net + data

The everyday tool is scikit-learn’s OneHotEncoder, with handle_unknown="ignore" so a category unseen at training time does not crash inference:

from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown="ignore", sparse_output=False)
enc.fit(X_train[["city"]])          # learns the vocabulary from TRAIN
X_test_oh = enc.transform(X_test[["city"]])  # unseen cities -> all zeros

Warning

Target encoding is the classic leakage trap. Computing category means over the whole dataset (including the rows you’ll predict) hands the model the answer. Always fit it out-of-fold.

6.4 — Outlier detection & handling

An outlier is a point that sits far from the rest of the data — sometimes a data-entry error like age = 1000, sometimes a genuine rare event like an actual fraudulent transaction. Because those two cases call for opposite responses, the discipline is always to detect first, then decide — never blindly delete, since the rare real points are frequently the very thing you are trying to predict.

Three detectors cover most needs. The z-score method flags any point more than \(k\) (usually 3) standard deviations from the mean, i.e. \(|z| > 3\). It is simple but circular: the mean and standard deviation it relies on are themselves distorted by the very outliers you are hunting, so a wild point can inflate \(\sigma\) enough to hide itself.

The IQR rule — the logic behind a boxplot — avoids that trap by using quartiles, which ignore the tails. You compute \(Q_1\), \(Q_3\), and \(\text{IQR} = Q_3 - Q_1\), then flag anything below \(Q_1 - 1.5\,\text{IQR}\) or above \(Q_3 + 1.5\,\text{IQR}\).

The fences themselves are worth writing once: \[\text{lower} = Q_1 - 1.5\,\text{IQR}, \qquad \text{upper} = Q_3 + 1.5\,\text{IQR}\]

In words: mark a point as an outlier if it falls more than one-and-a-half box-widths below the bottom of the box or above the top. Also written: flag \(x\) when \(x < Q_1 - 1.5(Q_3 - Q_1)\) or \(x > Q_3 + 1.5(Q_3 - Q_1)\).

The Isolation Forest is model-based and shines in many dimensions where the other two (which look one column at a time) fail. The plain-language idea: a normal point is hidden deep inside a dense crowd, so it takes many random cuts to fence it off on its own; an outlier sits alone out in the open, so just one or two random cuts already isolate it. Fewer cuts to isolate ⇒ more likely an outlier.

In a touch more detail: the algorithm builds random trees, each splitting on a random feature at a random threshold. It records how many splits — the path length from the root — it takes to separate each point. A short average path length across the trees is the anomaly signal. (No mean or standard deviation anywhere, which is why it copes with many dimensions at once.)

Work the IQR rule on [10, 12, 11, 13, 9, 100]. Sorted, that is [9, 10, 11, 12, 13, 100], giving \(Q_1 = 10.5\), \(Q_3 = 12.75\), and \(\text{IQR} = 2.25\). The upper fence is \(12.75 + 1.5 \times 2.25 = 16.1\), and 100 sails far past it, so it is flagged — while the healthy values 9 through 13 all sit comfortably inside the fences.

Once a point is flagged, you must handle it deliberately: drop it if it is clearly an error, cap or winsorize it by clipping it to the fence value so its influence is bounded without losing the row, or transform the whole column (the next section) so the long tail compresses naturally.

import numpy as np
x = np.array([10,12,11,13,9,100.])
q1,q3 = np.percentile(x,[25,75]); iqr = q3-q1
lo,hi = q1-1.5*iqr, q3+1.5*iqr
x_capped = np.clip(x, lo, hi)      # winsorize: 100 -> upper fence

For the multivariate case, the Isolation Forest is a one-liner in scikit-learn — -1 marks the points it judges anomalous:

from sklearn.ensemble import IsolationForest
iso = IsolationForest(contamination=0.05, random_state=0)
flags = iso.fit_predict(X_train)   # -1 = outlier, +1 = inlier
X_clean = X_train[flags == 1]

Warning

Don’t reflexively delete outliers. In fraud, fault, and disease detection the outliers are the signal — deleting them throws away the only examples of the thing you care about. Detect, understand the cause, then decide.

6.5 — Data transformation & discretization

Sometimes the problem is not a feature’s scale but its shape. Income, city population, and word counts are typically right-skewed: most values are modest, but a long tail of enormous values stretches off to the right. Many models behave better on roughly symmetric, bell-shaped inputs, and the way to get there is a nonlinear transform that squashes that tail.

The log transform \(x' = \log(x)\) — or \(\log(1+x)\) when the data contains zeros — compresses large values aggressively while leaving small ones nearly alone, which pulls the tail inward. Incomes [1k, 10k, 100k, 1M] become [3, 4, 5, 6] in \(\log_{10}\): evenly spaced and tame, with the millionaire no longer a thousand times louder than everyone else.

In words (the log transform): replace each value by its logarithm, so a value ten times larger only moves one unit further along. Also written: \(x' = \log_{10}(x)\), or \(x' = \ln(1+x)\) (NumPy’s log1p) when zeros are present.

Box-Cox generalizes the log into a family indexed by a tunable power \(\lambda\), and picks the exponent that makes the data most normal: \[x^{(\lambda)} = \begin{cases} \dfrac{x^\lambda - 1}{\lambda} & \lambda \neq 0 \\[4pt] \ln x & \lambda = 0 \end{cases}\]

In words: raise the value to a tunable power, shift and rescale it, and choose the power that makes the column look most bell-shaped; the power 0 falls back to the plain log. Also written: fit \(\lambda\) by maximizing the normality (log-likelihood) of the transformed column; \(\lambda=1\) is “no change,” \(\lambda=0.5\) is a square root, \(\lambda=0\) is the log.

Here \(\lambda = 1\) leaves the data essentially unchanged, \(\lambda = 0\) recovers the log, and \(\lambda = 0.5\) behaves like a square root. Box-Cox requires strictly positive data; its cousin Yeo-Johnson extends the same idea to zeros and negatives.

Discretization (binning) runs in the opposite direction — it turns a continuous feature into a handful of categories. Equal-width binning cuts the value range into slices of equal size; equal-frequency (quantile) binning instead places the same number of points in each bin. Binning age into child / adult / senior can let a simple linear model capture a non-linear effect it otherwise could not, at the cost of discarding the fine detail within each bin.

Watch the log transform do its work below — the lopsided, long-tailed curve eases into a tidy, symmetric bell:

import numpy as np
inc = np.array([1e3,1e4,1e5,1e6])
log_inc = np.log10(inc)            # [3,4,5,6] — tame, evenly spaced
# equal-frequency binning into 2 bins
edges = np.quantile(inc,[0,.5,1.]); bins = np.digitize(inc, edges[1:-1])

For Box-Cox / Yeo-Johnson, scikit-learn’s PowerTransformer fits \(\lambda\) on the training data automatically:

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method="yeo-johnson")   # handles zeros and negatives
X_train_t = pt.fit_transform(X_train)         # learns lambda per column on TRAIN
X_test_t  = pt.transform(X_test)

Tip

Reach for a log transform whenever a feature spans several orders of magnitude or its histogram has a long right tail. It is the single highest-value transform for messy real-world numeric data.

6.6 — Handling imbalanced data

In fraud, disease, and defect detection the interesting class is rare — perhaps 99% legitimate and 1% fraud. A model can then score 99% accuracy by predicting “legitimate” every single time while catching exactly zero fraud, which is useless. Imbalanced data breaks naive training because the loss is dominated by the abundant majority class, so the model is rewarded for ignoring the minority. Three families of fixes address this.

Resampling rebalances the training set directly. Oversampling duplicates minority rows until the classes are even; undersampling discards majority rows instead. Duplication risks overfitting the handful of minority points (the model memorizes the same rows seen many times), while undersampling throws away real, informative majority data.

SMOTE (Synthetic Minority Over-sampling Technique) is a smarter form of oversampling. Rather than copying a minority point, it interpolates: it picks a minority point, picks one of its minority nearest neighbors, and creates a brand-new synthetic point somewhere on the straight line between them. This populates the minority region with plausible new examples instead of stacking exact duplicates. \[x_{\text{new}} = x_i + \lambda\,(x_{nn} - x_i), \qquad \lambda \sim U(0,1)\]

In words: stand at a minority point, walk a random fraction of the way toward one of its minority neighbours, and drop a new synthetic example there. Also written: \(x_{\text{new}} = (1-\lambda)\,x_i + \lambda\,x_{nn}\) — a random convex combination (weighted average) of the point and its neighbour.

Concretely, if a minority point sits at \(x_i = (2, 4)\) and its chosen neighbor at \(x_{nn} = (4, 8)\), then with a random \(\lambda = 0.5\) the synthetic point lands at \((2,4) + 0.5\cdot(2,4) = (3, 6)\) — squarely between the two, a new minority example the model has never literally seen.

Below, a fresh synthetic point glides back and forth along the line joining a minority point to its neighbour — every position it touches is a valid new training example:

Class weights leave the data untouched and instead tell the loss function to penalize minority mistakes more heavily — weighting each class by roughly \(\tfrac{1}{\text{frequency}}\), so that one misclassified fraud costs about as much as the ~99 misclassified legitimate cases it is outnumbered by. This is often the cleanest fix, since it invents no synthetic data and is usually a single class_weight='balanced' flag in a library.

The balanced weight has a tidy closed form: \[w_c = \frac{N}{K \cdot n_c}\]

In words: each class’s weight is the total number of samples divided by (the number of classes times how many samples that class has), so rarer classes get proportionally heavier weights. Also written: \(w_c \propto \dfrac{1}{n_c}\), normalized so the average weight is 1 (\(N\) = total samples, \(K\) = number of classes, \(n_c\) = count of class \(c\)).

To make those weights concrete: with \(N = 100\) samples, \(K = 2\) classes, 99 majority and 1 minority, the majority weight is \(100/(2 \cdot 99) \approx 0.505\) and the minority weight is \(100/(2 \cdot 1) = 50\). One misclassified fraud now costs about 99× what one misclassified legitimate case does — which is exactly the imbalance ratio, rebalanced.

# class weights, inversely proportional to frequency
import numpy as np
y = np.array([0]*99 + [1])          # 99 majority, 1 minority
counts = np.bincount(y)
w = len(y) / (len(counts) * counts) # [~0.505, ~50.0]
# pass w[label] as each sample's loss weight

In practice, the two clean library paths are scikit-learn’s built-in class weighting and imbalanced-learn’s SMOTE — and SMOTE belongs inside an imblearn Pipeline so it only ever sees the training fold:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(class_weight="balanced")   # weights handled for you

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
pipe = Pipeline([("smote", SMOTE(random_state=0)),  # resamples train fold ONLY
                 ("clf", LogisticRegression())])
pipe.fit(X_train, y_train)

Warning

Resample and SMOTE only on the training fold, after splitting — never before. Synthesizing or duplicating points and then splitting leaks near-copies into the test set and gives a gloriously fake score. And stop trusting plain accuracy here; use precision/recall/F1 or AUC (Chapter 12).

6.7 — Feature extraction

Feature extraction derives new, more informative variables from raw inputs that are not usable as they stand — free text, pixels, timestamps. It is worth contrasting with selection (the next section) right away: extraction creates new features, whereas selection merely keeps a subset of features that already exist.

The richest everyday source is datetime. A raw timestamp like 2026-06-25 14:30 means little to most models, but its parts are gold: hour, day-of-week, month, is_weekend, is_holiday. The subtlety is that cyclical features such as hour and month wrap around — hour 23 and hour 0 are adjacent in time, yet the integers 23 and 0 look maximally far apart. The fix is to place the value on a circle with sine and cosine so that midnight meets midnight: \[\text{hour}_{\sin}=\sin\!\Big(\tfrac{2\pi\,h}{24}\Big),\quad \text{hour}_{\cos}=\cos\!\Big(\tfrac{2\pi\,h}{24}\Big)\]

In words: map the hour onto a clock face — two coordinates (sine and cosine) — so that 23:00 and 00:00 sit right next to each other instead of at opposite ends of a number line. Also written: with the angle \(\theta = 2\pi h / 24\), the pair \((\cos\theta, \sin\theta)\) are the point’s coordinates on the unit circle.

The clock below shows why: a hand sweeps the hours around a circle, and you can see that 23:00 and 00:00 land as neighbours at the top rather than at opposite ends of a line.

For text, the job is to turn documents into vectors. Bag-of-words simply counts how often each word appears; TF-IDF (term frequency–inverse document frequency) refines that by down-weighting words common across all documents — so “the” counts for almost nothing — and up-weighting words that are distinctive to a document. The TF-IDF weight of a term is a product: \[\text{tfidf}(t,d) = \text{tf}(t,d)\cdot\log\frac{N}{1+\text{df}(t)}\]

In words: how often a word appears in this document, multiplied by how rare it is across all documents, so common-everywhere words are pushed toward zero. Also written: \(\text{tfidf} = \text{tf} \times \text{idf}\), where \(\text{idf}(t) = \log\frac{N}{1+\text{df}(t)}\) (\(N\) = number of documents, \(\text{df}(t)\) = documents containing \(t\)).

Modern pipelines increasingly replace both with learned embeddings (Natural Language Processing).

For images, classic pipelines hand-built edge and corner descriptors, while modern ones pass the image through a pretrained CNN and read out an intermediate layer as the feature vector (Computer Vision). And when you have many correlated numeric features, PCA extracts a few uncorrelated combinations that retain most of the variance — covered in full in Dimensionality Reduction.

import numpy as np, datetime as dt
t = dt.datetime(2026,6,25,14,30)
feats = {
  'hour': t.hour, 'dow': t.weekday(), 'is_weekend': int(t.weekday()>=5),
  'hour_sin': np.sin(2*np.pi*t.hour/24),   # midnight wraps correctly
  'hour_cos': np.cos(2*np.pi*t.hour/24),
}

Turning a column of raw documents into a TF-IDF matrix is a two-line job with scikit-learn, and again the vocabulary and IDF weights are learned on training text only:

from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(stop_words="english", max_features=5000)
Xtr = vec.fit_transform(train_docs)   # learns vocab + idf from TRAIN
Xte = vec.transform(test_docs)        # reuses them; new words ignored

Tip

A single well-chosen extracted feature (the right datetime part, a ratio of two columns) often beats weeks of model tuning. Domain knowledge pays its highest dividend here.

6.8 — Feature selection

More features are not better. Irrelevant and redundant columns add noise, slow training, invite overfitting, and muddy interpretation — collectively the curse of dimensionality. Feature selection keeps only the useful subset. Note the clean contrast with Chapter 07’s dimensionality reduction: selection keeps original columns, so the survivors remain interpretable, whereas reduction builds new combined ones. Three families trade off speed against accuracy.

Filter methods score each feature on its own, independent of any model — by correlation with the target, mutual information, or a chi-square test — and keep the top scorers. They are fast and model-agnostic, but blind to interactions: two columns that are useless alone yet powerful together will both be dropped.

Wrapper methods instead train an actual model on different feature subsets and keep whichever subset scores best. Recursive Feature Elimination is the classic: fit a model, drop the weakest feature, refit, and repeat. Because it measures real model performance it is accurate, but the repeated refitting makes it expensive.

Embedded methods select during training and strike the best cost-quality balance, which is why they are the usual default. L1 (Lasso) regularization adds a penalty that drives useless coefficients exactly to zero, performing selection for free as the model fits (Regression). Tree-based feature importances rank features by how much each one reduces impurity across the trees.

flowchart TD
  A[All features] --> F[Filter: score each alone<br/>fast, ignores interactions]
  A --> W[Wrapper: train on subsets<br/>accurate, slow]
  A --> E[Embedded: select while training<br/>L1 / tree importance]
  F --> S[Selected subset]
  W --> S
  E --> S

A small filter example shows both the appeal and the blind spot. Suppose three features have correlations f1 = 0.82, f2 = 0.05, f3 = 0.78 with the target. A filter keeps f1 and f3 and drops f2. The catch: if f2 were useless on its own but doubled f1’s signal in combination, the filter would never notice — and only a wrapper or an embedded method, which see features together rather than one at a time, would catch it.

import numpy as np
X = np.random.rand(100,3); y = X[:,0]*2 + X[:,2] + 0.01*np.random.rand(100)
corr = [abs(np.corrcoef(X[:,j], y)[0,1]) for j in range(3)]
keep = [j for j,c in enumerate(corr) if c > 0.3]  # filter: drop f2

scikit-learn packages one tool per family — SelectKBest (filter), RFE (wrapper), and SelectFromModel over an L1 model (embedded):

from sklearn.feature_selection import SelectKBest, f_classif, RFE, SelectFromModel
from sklearn.linear_model import LogisticRegression

filt = SelectKBest(f_classif, k=10).fit(X_train, y_train)             # filter
wrap = RFE(LogisticRegression(), n_features_to_select=10).fit(X_train, y_train)  # wrapper
emb  = SelectFromModel(LogisticRegression(penalty="l1", solver="liblinear")
                       ).fit(X_train, y_train)                        # embedded (L1)

Warning

Run feature selection inside cross-validation, not once on the full dataset beforehand. Picking features using the whole dataset — test rows included — is leakage and silently inflates your reported score.

6.9 — Building a leak-proof pipeline

Up to here each step has been shown in isolation, but the warnings all share one root cause and one cure. The root cause: any statistic learned from data — a median, a scaler’s mean, an encoder’s category means, a TF-IDF vocabulary, a SMOTE neighbourhood, a selected feature set — is a parameter fit to data, and if it sees the test rows it has leaked. The cure is to bundle every such step into a single object that exposes one fit (which only ever touches training data) and one transform, so the discipline is enforced by construction rather than by remembering.

The intuition: a pipeline is a sealed assembly line. Training data goes in one end and every station calibrates itself; at scoring time new data runs through the already-calibrated stations untouched. There is no way for a downstream station to peek at the test data because it was bolted shut before the test data ever arrived.

flowchart LR
  subgraph FIT [fit — TRAIN ONLY]
    A1[impute] --> B1[scale] --> C1[encode] --> D1[model]
  end
  subgraph APPLY [transform — VALID / TEST]
    A2[reuse medians] --> B2[reuse mu, sigma] --> C2[reuse vocab] --> D2[predict]
  end
  FIT -.learned params.-> APPLY

scikit-learn’s ColumnTransformer plus Pipeline is the standard way to express this. Numeric and categorical columns get different treatment, the whole thing is fit once on the training split, and cross-validation re-fits it cleanly on each fold:

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

num = ["age", "salary"]; cat = ["city"]
pre = ColumnTransformer([
    ("num", Pipeline([("imp", SimpleImputer(strategy="median")),
                      ("sc",  StandardScaler())]), num),
    ("cat", Pipeline([("imp", SimpleImputer(strategy="most_frequent")),
                      ("oh",  OneHotEncoder(handle_unknown="ignore"))]), cat),
])
model = Pipeline([("pre", pre), ("clf", LogisticRegression())])

# every transform is re-fit on each training fold only — no leakage
scores = cross_val_score(model, X, y, cv=5, scoring="f1")

A worked sanity check makes the leakage concrete. Imagine 1000 rows, scaled by a mean computed over all 1000, then split 800/200 for training and test. That global mean was nudged — however slightly — by the 200 test rows, so the test score is a touch optimistic; on a small or skewed dataset that “touch” can be the difference between a model that ships and one that fails in production. Re-running the exact same code with the scaler inside the pipeline removes the nudge, and the cross-validated score drops to its honest value. The honest, slightly-lower number is the one to trust.

Tip

If you remember one habit from this chapter, make it this: never call fit on anything outside a pipeline. fit_transform on the training split, transform on everything else — or better, let a Pipeline do both so you cannot get it wrong.

6.10 — Quick reference

Term / method	Meaning in one line	When / why to reach for it
Fit on train only	Learn every statistic from the training split, apply to test	Always — the one rule that prevents leakage
MCAR / MAR / MNAR	Missingness depends on nothing / observed cols / the hidden value	Diagnose before imputing; MNAR needs a was-missing flag
Median imputation	Fill gaps with the column middle	Default for numeric columns you suspect are skewed or noisy
Was-missing flag	Extra 0/1 column marking imputed cells	Whenever absence itself carries signal (esp. MNAR)
Z-score \(z=(x-\mu)/\sigma\)	Recenter to mean 0, sd 1	Default scaler for distance/gradient models
Robust scaling	Center on median, divide by IQR	Heavy outliers present; resistant to extreme tails
Min-max \([0,1]\)	Map min→0, max→1	Need a bounded range (pixels, sigmoid targets)
One-hot encoding	One 0/1 column per category	Low-cardinality nominal features; no fake order
Target encoding	Replace category with its target mean	High cardinality — but fit out-of-fold or it leaks
IQR rule	Flag outside \(Q_1-1.5\,\text{IQR}\) … \(Q_3+1.5\,\text{IQR}\)	Quick, robust 1-D outlier detection (the boxplot)
Isolation Forest	Anomaly = isolated in few random cuts	Multivariate outliers across many features
Log / Box-Cox	Compress a long right tail toward bell shape	Features spanning several orders of magnitude
SMOTE	Synthesize minority points between neighbours	Imbalanced classes; train fold only, inside an imblearn pipeline
Class weights \(w_c=N/(K\,n_c)\)	Penalize minority errors more in the loss	Cleanest imbalance fix; invents no data
Cyclical sin/cos	Encode hour/month on a circle	So 23:00 and 00:00 read as adjacent
TF-IDF	Term frequency × inverse document frequency	Turn documents into vectors, down-weight common words
Filter / wrapper / embedded	Score alone / train on subsets / select while fitting	Speed→accuracy trade for feature selection; L1 default
Pipeline / ColumnTransformer	Bundle fit+transform into one sealed object	Enforces fit-on-train-only by construction

6.11 — Key takeaways

Preprocessing often decides a project more than model choice — clean inputs beat clever algorithms.
Fit on train, apply to test. Every statistic — imputation value, scaler, encoder means, SMOTE, feature selection — must be learned from training data only, or you leak. A Pipeline / ColumnTransformer enforces this by construction.
Missing values: prefer median for skewed numeric, mode for categorical; keep a “was-missing” flag, and respect the missingness mechanism (MCAR/MAR/MNAR).
Scale (z-score by default, robust for outliers, min-max for bounded) for distance- and gradient-based models; trees don’t care.
Encode without inventing order: one-hot for low cardinality, target/hashing/embeddings for high — and fit target encoding out-of-fold.
Detect outliers with IQR or Isolation Forest; cap or transform rather than blindly delete, since rare real points may be the signal.
Log/Box-Cox tame skewed, multi-order-of-magnitude features; binning trades detail for non-linear capture.
Fight imbalance with class weights or SMOTE (training fold only) and judge with precision/recall/F1/AUC, never plain accuracy.
Extraction creates features (datetime parts, TF-IDF, CNN/embedding vectors); selection keeps the useful originals (filter / wrapper / embedded).

6.12 — See also

Dimensionality Reduction — PCA, t-SNE, UMAP: building new compressed features rather than selecting existing ones.
AI, ML & the Learning Process — train/validation/test splits and the leakage discipline this chapter depends on.
Model Evaluation & Tuning — precision/recall/F1/AUC and cross-validation, the right metrics for imbalanced data.
Regression — L1/Lasso regularization as embedded feature selection.
Probability & Statistics — distributions, skew, and the mean/median/IQR machinery behind scaling and outliers.
Natural Language Processing — text feature extraction beyond bag-of-words and TF-IDF.
Anomaly & Fraud Detection — Isolation Forest and imbalanced learning applied end to end.

↪ The thread continues → Chapter 07 · 🗜️ Dimensionality Reduction

Clean data is often too wide: hundreds of features, most redundant. Before modeling, we compress it down to what matters — the job of dimensionality reduction.

📖 All chapters | ← 05 · 🌐 AI, ML & the Learning Process | 07 · 🗜️ Dimensionality Reduction →