flowchart TD
E[Total error] --> B["Bias²<br/>too simple → underfit"]
E --> V["Variance<br/>too sensitive → overfit"]
E --> N["Noise σ²<br/>irreducible"]
B -.attacked by.-> Boost[Boosting]
V -.attacked by.-> Bag[Bagging / Random Forest]
Chapter 10 — 🌳 Ensemble Methods
📖 All chapters | ← 09 · 📐 Classification Algorithms | 11 · 🔮 Clustering & Unsupervised Learning →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
A single decision tree is a weak student: it overfits, it’s jumpy, and its answer can flip if you nudge the data. Ensemble methods are the trick that turns a crowd of mediocre models into one strong predictor — and they are still, in 2026, the algorithms that win most competitions and power most production systems on tabular data. This chapter sits in Classical Machine Learning and builds directly on decision trees: it is the bridge from “one model” to “many models that vote, average, or correct each other.”
🧭 In context: Classical ML, post-trees · used to squeeze maximum accuracy out of tabular/structured data · the one key idea: many imperfect models combined beat one perfect-looking model, because their errors cancel.
💡 Remember this: combining many models works only when their mistakes are uncorrelated — bagging makes trees disagree to cancel variance, boosting chains them to cancel bias, and lowering correlation is the lever behind every method in this chapter.
10.1 — Base classifiers / weak learners (and why combining helps)
A base classifier (or weak learner) is a simple model used as a building block — most often a shallow decision tree, sometimes a single split called a decision stump. The word “weak” has a precise meaning here: the learner only needs to do slightly better than random guessing. That sounds like a low bar, and it is — but the surprising result at the heart of this whole chapter is that you can stack many such weak learners into one strong learner whose accuracy is far higher than any single member’s.
The cleanest way to feel why this works is a coin-flip analogy. Imagine 100 friends each guessing a yes/no answer, and each is right 60% of the time — barely better than a coin. If their mistakes are independent, the majority vote of the crowd is right far more than 60% of the time, because for the crowd to be wrong, an unlucky majority of them all have to err at once, which is rare. The members are weak; the committee is strong. Ensembles are exactly this committee, and the rest of the chapter is about how to build members whose mistakes don’t all happen together.
Here is that crowd effect as a real number. With 100 independent voters each 60% accurate, the chance the majority is wrong is the chance that 50 or more of them err — which works out to under 2%. So a committee of 60%-accurate coin-flippers votes correctly over 98% of the time. The math is just the binomial tail:
\[P(\text{crowd wrong}) = \sum_{k=\lceil M/2\rceil}^{M} \binom{M}{k}(1-p)^k\,p^{\,M-k}\]
In words: the crowd is wrong only when at least half of its members happen to be wrong at the same time; sum the probability of every such unlucky split.
Also written: \(P(\text{crowd wrong}) = P\!\left(\text{Binomial}(M, 1-p) \ge \tfrac{M}{2}\right)\), the upper tail of a binomial distribution.
from scipy.stats import binom
M, p = 100, 0.60
print("crowd error:", round(binom.sf(M//2 - 1, M, 1-p), 4)) # ~0.0176 -> 98%+ correctThe catch hidden in “if their mistakes are independent” is the whole game: real models trained on the same data make correlated mistakes, so the crowd never reaches this idealized 98%. Every technique in this chapter is, at bottom, a different way to make the members disagree more.
The picture below makes the crowd effect visible: as you add more independent 60%-accurate voters (left to right), the probability the majority is wrong collapses toward zero, even though no single voter ever improved.
To make this precise we need the bias–variance decomposition (its full probability and statistics derivation lives elsewhere). The expected prediction error of a model splits into three pieces:
\[\text{Error} = \underbrace{\text{Bias}^2}_{\text{wrong on average}} + \underbrace{\text{Variance}}_{\text{jumpy across datasets}} + \underbrace{\sigma^2}_{\text{irreducible noise}}\]
In words: total error is how far off you are on average (bias), plus how much your answer jiggles when the training data changes (variance), plus noise no model can ever remove.
Also written: for squared loss, \(\mathbb{E}[(y-\hat f)^2] = (\,\mathbb{E}[\hat f]-f\,)^2 + \mathbb{E}[(\hat f-\mathbb{E}[\hat f])^2] + \sigma^2\).
- Bias is systematic error: the model is too simple to capture the truth (a straight line trying to fit a curve). High bias means underfitting.
- Variance is sensitivity: retrain on a slightly different sample and the predictions swing wildly. High variance means overfitting.
- The irreducible noise \(\sigma^2\) is the randomness in the data itself — no model can beat it.
A useful mental picture is a dartboard. Low bias, low variance is a tight cluster on the bullseye. High bias, low variance is a tight cluster, but in the wrong corner — consistently off. Low bias, high variance is scattered all around the bullseye — right on average but wild any single throw. High bias, high variance is scattered and off-center.
A deep decision tree is the classic high-variance, low-bias model: it can fit almost anything, but it memorizes noise. A stump is the opposite — high bias, low variance. The two families of ensembles exploit exactly this split, each attacking a different term:
- Bagging (and random forests) attacks variance: average many high-variance, low-bias trees so their random wobbles cancel.
- Boosting attacks bias: chain many high-bias weak learners so each one fixes what the last got wrong.
Here is the variance-cancellation effect with real numbers. Imagine \(M\) models, each with prediction variance \(\sigma^2\). If their errors were perfectly independent, averaging them gives variance \(\sigma^2/M\) — averaging 25 independent trees cuts variance 25-fold. But in reality trees are correlated (they see overlapping data and share features), so with pairwise correlation \(\rho\) the averaged variance becomes:
\[\text{Var}_{\text{avg}} = \rho\,\sigma^2 + \frac{1-\rho}{M}\,\sigma^2\]
In words: averaging knocks down only the independent slice of each tree’s variance; the slice the trees share (their correlation) survives no matter how many you average.
Also written: \(\text{Var}_{\text{avg}} = \sigma^2\!\left[\rho + \dfrac{1-\rho}{M}\right]\), which \(\to \rho\sigma^2\) as \(M\to\infty\).
import numpy as np
sigma2 = 1.0
for rho in [0.0, 0.3, 0.6]: # correlation between trees
for M in [1, 10, 100]:
v = rho*sigma2 + (1-rho)/M*sigma2
print(f"rho={rho} M={M:3d} var={v:.3f}")
# rho=0: var -> 0 as M grows. rho=0.6: var floors at 0.6 no matter how many trees.The lesson hiding in that second term is the single most important idea in the chapter. As \(M\) grows, the right-hand term vanishes and the variance floors at \(\rho\sigma^2\). So once you have many trees, piling on more trees buys nothing — the only way to keep improving is to lower the correlation \(\rho\) between them. That is precisely the job random forests take on in §10.3.
Rule of thumb: bag things that overfit (deep trees), boost things that underfit (stumps). If your single model already has low variance, bagging it buys you almost nothing — you’d just be averaging many copies of the same stable answer.
10.2 — Bagging & bootstrap aggregation
Bagging — short for Bootstrap AGGregating — is the simplest variance-reduction recipe there is. Train many copies of the same model on different random resamples of the data, then combine them: average their outputs for regression, or take a majority vote for classification.
The resampling trick is the bootstrap: from a dataset of \(n\) rows, draw \(n\) rows with replacement. Because you put each drawn row back before drawing the next, some rows get picked twice or thrice and some never get picked at all. Each bootstrap sample is therefore a slightly different “parallel universe” version of your data, so each tree learns slightly different quirks — and, exactly as §10.1 promised, those quirks partly cancel when you average.
Here is the bootstrap on a tiny dataset of five rows so you can see it concretely. Drawing five times with replacement from rows [A B C D E] might give [C A C E B] for one tree and [D D A E C] for another — notice C appears twice in the first and B is missing entirely.
A neat and useful fact: each bootstrap sample leaves out about 37% of the original rows. The probability that one specific row is never chosen across \(n\) independent draws is \((1 - 1/n)^n\), which converges to \(e^{-1} \approx 0.368\) as \(n\) grows. Those left-out rows have a name — out-of-bag (OOB) rows — and we will turn them into free validation data in §10.3.
In words: each of the \(n\) draws independently misses a given row with probability \(1-\frac1n\); missing it on all \(n\) draws multiplies those together.
Also written: \(P(\text{row left out}) = \left(1-\frac1n\right)^n \xrightarrow{\,n\to\infty\,} e^{-1} \approx 0.368\).
import numpy as np
from sklearn.tree import DecisionTreeRegressor
rng = np.random.default_rng(0)
X = np.sort(rng.uniform(0, 10, 80))[:, None]
y = np.sin(X).ravel() + rng.normal(0, 0.3, 80) # noisy sine
def bag(X, y, n_models=50):
trees, n = [], len(X)
for _ in range(n_models):
idx = rng.integers(0, n, n) # bootstrap: sample WITH replacement
t = DecisionTreeRegressor(max_depth=6).fit(X[idx], y[idx])
trees.append(t)
return trees
trees = bag(X, y)
pred = np.mean([t.predict(X) for t in trees], axis=0) # average the herdIn practice you rarely hand-roll the bootstrap loop — scikit-learn’s BaggingRegressor/BaggingClassifier wraps any base estimator and does the resampling, OOB scoring, and parallelism for you:
from sklearn.ensemble import BaggingRegressor
from sklearn.tree import DecisionTreeRegressor
bag = BaggingRegressor(
estimator=DecisionTreeRegressor(max_depth=6),
n_estimators=50, oob_score=True, n_jobs=-1, random_state=0
).fit(X, y)
pred = bag.predict(X)
print("OOB R²:", round(bag.oob_score_, 3)) # free validation from the left-out 37%flowchart TD
D[(Training data)] --> B1[Bootstrap sample 1] --> T1[Tree 1]
D --> B2[Bootstrap sample 2] --> T2[Tree 2]
D --> B3[Bootstrap sample 3] --> T3[Tree 3]
T1 --> A{Average / vote}
T2 --> A
T3 --> A
A --> P[Final prediction]
The SVG below shows the payoff. The faint grey lines are individual bagged trees — each one jagged and overfit, chasing the noise. The bold blue line is their average: smooth, stable, and far closer to the true underlying signal. Averaging didn’t make any single tree better; it made the combination better by letting the wobbles cancel.
Bagging only helps when the base model is unstable — when small data changes produce big prediction changes. Deep trees, yes. A linear regression, almost never: it barely moves between bootstrap samples, so averaging copies of it changes nothing.
10.3 — Random forests (feature randomness, OOB error, importance)
Bagging trees alone runs straight into the correlation floor from §10.1. Left free to look at every feature, each tree tends to pick the same dominant feature for its very first, most important split — so the trees come out looking alike, \(\rho\) stays high, and the variance can’t drop below \(\rho\sigma^2\) no matter how many trees you add. Random forests break this logjam with one decisive twist on top of bagging: feature randomness, also called the random subspace method.
The idea is to handicap every split. At each node, instead of considering all \(p\) features, the tree is only allowed to choose its split from a fresh random subset of \(m < p\) of them (typical defaults: \(m=\sqrt{p}\) for classification, \(m=p/3\) for regression). Sometimes the strongest feature simply isn’t in the allowed subset, so the tree is forced to use a different one. Across the forest this means different trees rely on different features, which decorrelates them — lowering \(\rho\), lowering the variance floor, and lifting accuracy. That is the entire recipe: random forest = bagging + per-split feature sampling.
A concrete picture: with \(p=16\) features the classification default is \(m=\sqrt{16}=4\). So at every single node the tree rolls the dice, sees only 4 of the 16 columns, and must split on the best of those — even if its favorite global feature is among the 12 it can’t see this time. That forced variety is what keeps the trees from being clones of each other.
flowchart LR
subgraph Forest
T1[Tree: at each split<br/>pick best of random m features]
T2[Tree: at each split<br/>pick best of random m features]
T3[Tree: at each split<br/>pick best of random m features]
end
T1 --> V{Majority vote / average}
T2 --> V
T3 --> V
V --> O[Prediction]
T1 -. rows it never saw .-> OOB[OOB error estimate]
T2 -. rows it never saw .-> OOB
Out-of-bag (OOB) error is the random forest’s free lunch, and it pays off the ~37% of rows each tree skips. To score a single row \(i\) honestly, gather only the trees that did not train on row \(i\) — to them, that row is genuinely unseen test data — and average just their predictions for it. Repeat for every row and you get a validation estimate that needs no separate test set and no cross-validation loop. You trained one forest and got an honest accuracy estimate for free.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)
rf = RandomForestClassifier(n_estimators=300, oob_score=True,
max_features="sqrt", random_state=0).fit(X, y)
print("OOB accuracy:", round(rf.oob_score_, 3)) # ~0.98, no test split needed
print("top feature:", load_wine().feature_names[rf.feature_importances_.argmax()])Feature importance falls out of the same machinery. The classic measure is mean decrease in impurity (MDI): every time a feature is chosen for a split it reduces that node’s impurity (Gini or entropy); sum those reductions across every split in every tree and you get a score for how much each feature “helped.” It is computed essentially for free during training. The catch is that MDI is biased toward features with many distinct values — a high-cardinality column (or even a random ID) gets many chances to look useful. The more trustworthy alternative is permutation importance: take the trained forest, randomly shuffle one feature’s column, and measure how much accuracy drops. A big drop means the model genuinely relied on that feature; almost no drop means it didn’t.
from sklearn.inspection import permutation_importance
r = permutation_importance(rf, X, y, n_repeats=20, random_state=0)
for i in r.importances_mean.argsort()[::-1][:3]:
print(f"{load_wine().feature_names[i]:25s} {r.importances_mean[i]:.3f}")
# drop in accuracy when each feature's column is shuffled -> trustworthy rankingDon’t trust MDI importances blindly — they inflate the apparent importance of continuous and high-cardinality features and can even rank a pure-noise ID column highly. Prefer permutation importance for anything you’ll act on, and never read forest importances as causal (a feature can score high just by correlating with the real driver — see Causal Inference).
10.4 — Extra Trees (extremely randomized trees)
If feature randomness decorrelates trees by limiting which columns each split may consider, Extremely Randomized Trees (Extra Trees) push the same idea one step further: they also randomize where on a column to cut. A random forest still searches, among its allowed features, for the single best threshold. Extra Trees skip that search — for each candidate feature they pick a random threshold and keep the best of those random cuts.
The intuition is a deliberately careless surveyor. A random forest measures carefully and draws the optimal fence line on each plot; Extra Trees throw the fence down at a random spot and only compare a few random throws. Any single tree is therefore sloppier (higher bias), but the trees end up far less alike (lower correlation \(\rho\)) — and since §10.1 told us the variance floor is \(\rho\sigma^2\), dropping \(\rho\) can win back more than the added bias costs. Two more knock-on effects: training is much faster because no threshold search runs at every node, and by default Extra Trees fit each tree on the whole dataset (no bootstrap), since the split randomness alone already supplies the diversity.
| Random Forest | Extra Trees | |
|---|---|---|
| Feature subset per split | random \(m\) of \(p\) | random \(m\) of \(p\) |
| Threshold on a feature | best found by search | random, then best of those |
| Bootstrap sampling | yes (OOB available) | no by default (whole dataset) |
| Single-tree bias | lower | higher |
| Tree-to-tree correlation \(\rho\) | higher | lower |
| Training speed | baseline | faster (no split search) |
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)
et = ExtraTreesClassifier(n_estimators=300, max_features="sqrt",
random_state=0).fit(X, y)
print("Extra Trees train acc:", round(et.score(X, y), 3))
# same API as RandomForestClassifier; usually trains noticeably fasterThe picture: a random forest scans a feature and places the fence at the best threshold it can find; Extra Trees just throw the fence down at a random spot. Sloppier per tree, but no two trees place their fences alike — and that disagreement is exactly what the averaging cashes in.
In words: Extra Trees trade a little accuracy per tree for a lot more disagreement between trees, and the averaging cashes that disagreement back in as lower variance.
Extra Trees are a cheap thing to try alongside a random forest: same API, often comparable accuracy, and faster to fit. They tend to shine when features are noisy, since not over-searching for the “best” split keeps any single tree from latching onto noise.
10.5 — Boosting (the general idea of sequential error-correction)
Bagging builds its trees in parallel, and they never communicate — each one is grown in its own bootstrap universe, oblivious to the others. Boosting flips this completely: it builds weak learners one at a time, in sequence, and each new learner is trained specifically to fix the mistakes the ensemble has made so far. Where bagging fights variance, boosting fights bias — it takes deliberately high-bias weak learners (often stumps) and chains them into a low-bias whole.
The intuition is a study group drilling a weak student. In round 1 the student answers all the questions and gets some wrong. In round 2 you don’t re-test everything — you make them focus only on the questions they missed. Round 3, you zero in on what is still wrong after that. Each round’s effort is mediocre on its own, but the running total of all their attempts converges to nearly perfect. Boosting is this loop, automated.
flowchart LR
D[(Data)] --> W1[Weak learner 1]
W1 --> E1[look at its errors]
E1 --> W2[Weak learner 2<br/>focus on those errors]
W2 --> E2[look at remaining errors]
E2 --> W3[Weak learner 3<br/>focus on those]
W3 --> S[Weighted sum<br/>of all learners]
S --> P[Strong prediction]
The animation below is the same idea as a ball easing down a slope: each boosting round is one step downhill on the error landscape. Early steps are big and clumsy; later ones are tiny, settling into the valley. Boosting never jumps to the bottom — it walks there, one weak learner at a time.
Two flavors dominate, and they differ only in how “focus on the errors” is implemented:
- AdaBoost (§10.6): re-weight the data points — misclassified rows get heavier so the next learner is pulled toward them.
- Gradient boosting (§10.7): fit the next learner directly to the residual errors, which turns out to be gradient descent performed in function space.
Because boosting chases errors relentlessly, given enough rounds it will start fitting the noise — it can overfit if you use too many rounds or too-deep trees. Bagging, by contrast, essentially cannot overfit by adding more trees. Boosting therefore needs brakes: a small learning rate and early stopping.
10.6 — AdaBoost (reweighting)
AdaBoost (Adaptive Boosting, 1995) was the first boosting algorithm that actually worked in practice, and it implements “focus on the errors” by reweighting the data. It keeps a weight on every training row — all equal at the start — and after fitting each weak learner it raises the weight of every row that learner got wrong and lowers the rest. The next stump is trained on this reweighted data, so it is compelled to pay attention to the hard cases the previous stumps kept missing. Each stump also earns its own say-so weight \(\alpha\) based on how accurate it was, and the final prediction is the weighted vote of all the stumps.
The mechanics, for a stump \(t\) with weighted error rate \(\varepsilon_t\) (labels \(y_i \in \{-1,+1\}\), stump output \(h_t(x_i)\in\{-1,+1\}\)):
\[\alpha_t = \tfrac{1}{2}\ln\!\frac{1-\varepsilon_t}{\varepsilon_t}, \qquad w_i \leftarrow w_i \, e^{-\alpha_t y_i h_t(x_i)}\]
In words: trust a stump in proportion to the log-odds of it being right; then make every row it got wrong heavier and every row it got right lighter, going into the next round.
Also written: the update is \(w_i \leftarrow w_i \cdot e^{+\alpha_t}\) if the stump is wrong on row \(i\), and \(w_i \leftarrow w_i \cdot e^{-\alpha_t}\) if it is right (since \(y_i h_t(x_i)=\mp 1\)), followed by renormalizing \(w\) to sum to 1.
Read the first formula as a “how much should we trust this stump” dial: a stump that is barely better than chance (\(\varepsilon_t \to 0.5\)) gets \(\alpha_t \to 0\) and almost no vote, while a near-perfect stump (\(\varepsilon_t \to 0\)) gets a huge vote. The second formula does the reweighting: when a point is classified correctly the product \(y_i h_t(x_i) = +1\) so its weight is multiplied by \(e^{-\alpha_t} < 1\) and shrinks; when it’s wrong the product is \(-1\) so the weight is multiplied by \(e^{+\alpha_t} > 1\) and grows.
Here is a full worked round on four points, each starting at weight \(0.25\). Suppose stump 1 misclassifies exactly one point and has weighted error \(\varepsilon_1 = 0.25\):
import numpy as np
eps = 0.25
alpha = 0.5*np.log((1-eps)/eps) # 0.549 -> this stump's vote
w = np.array([.25,.25,.25,.25])
y_h = np.array([+1,+1,-1,+1]) # 3rd point misclassified (y*h = -1)
w = w*np.exp(-alpha*y_h) # wrong point grows, others shrink
w /= w.sum() # renormalize so weights sum to 1
print(np.round(w,3)) # [0.167 0.167 0.5 0.167] -> hard point now carries half the weightThe misclassified point jumped from a quarter of the weight to a full half — so when stump 2 is trained, it simply cannot afford to ignore that point. AdaBoost was elegant and genuinely revolutionary, but it has a well-known Achilles’ heel: because misclassified points get exponentially heavier, a handful of mislabeled rows or genuine outliers can hijack training, since the algorithm keeps doubling down on points it “can’t” get right. That sensitivity to noise is one of the main reasons gradient boosting eventually displaced it.
There is a deeper unifying fact worth knowing: AdaBoost is not a one-off recipe but a special case of forward stagewise additive modeling under the exponential loss \(L(y, F) = e^{-yF(x)}\). Minimizing that loss one weak learner at a time derives both the \(\alpha_t\) formula and the reweighting rule above — which is exactly why AdaBoost slots so naturally next to gradient boosting in §10.7, where you swap the exponential loss for any differentiable loss you like.
In real code you would reach for the library version, which runs the whole loop for you:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
ada = AdaBoostClassifier(
estimator=DecisionTreeClassifier(max_depth=1), # stumps
n_estimators=200, learning_rate=0.5, random_state=0
).fit(X, y)If AdaBoost is behaving erratically, suspect label noise first. A few wrong labels act like magnets that the algorithm chases ever harder. Gradient boosting with a small learning rate, or robust losses, handles dirty data far better.
10.7 — Gradient boosting (fitting residuals)
Here is the whole idea in one sentence: each new tree is trained to predict the leftover error of all the trees so far. Predict, see what you got wrong, fit a tree to that mistake, add a small slice of it, repeat. That leftover error has a name — the residual \(y - \hat y\) — and chasing it round after round is gradient boosting.
The fancier framing (skip it on a first read) is that this is gradient descent, but in an unusual place. Normal gradient descent nudges numbers (weights) downhill on the loss. Gradient boosting instead adds whole functions (trees) — it descends in “function space.” The link is a small piece of calculus: for squared-error loss \(\tfrac12(y-\hat y)^2\), the negative gradient with respect to the prediction works out to be exactly the residual \(y-\hat y\). So “fit the next tree to the negative gradient” and “fit the next tree to the residual” are the same instruction whenever the loss is squared error.
The model grows additively:
\[F_{m}(x) = F_{m-1}(x) + \nu \cdot h_m(x), \qquad h_m \text{ fit to } r_i = y_i - F_{m-1}(x_i)\]
In words: the new ensemble is the old one plus a small fraction \(\nu\) of a fresh tree, and that fresh tree is trained to predict the errors the old ensemble is still making.
Also written: \(F_M(x) = F_0(x) + \nu \sum_{m=1}^{M} h_m(x)\) — the final model is the starting guess plus a shrunken sum of every correction tree.
Here \(\nu\) is the learning rate (also called shrinkage), typically between 0.01 and 0.1. A small \(\nu\) means each tree only corrects a little of the residual, so you need many more trees to finish the job — but the result generalizes better, because no single tree is allowed to dominate. This learning-rate-versus-number-of-trees tradeoff is the central knob of every boosting library.
Here is gradient boosting built from scratch in a dozen lines, fitting a noisy curve. Watch what each line does: start by predicting the mean, then repeatedly fit a small tree to the current residuals and add a fraction of it.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
rng = np.random.default_rng(1)
X = np.sort(rng.uniform(0, 6, 120))[:, None]
y = np.sin(X).ravel() + rng.normal(0, 0.2, 120)
F = np.full_like(y, y.mean()) # round 0: predict the mean
nu, trees = 0.3, []
for m in range(50):
resid = y - F # negative gradient for squared loss
h = DecisionTreeRegressor(max_depth=3).fit(X, resid) # fit a tree TO THE ERRORS
F = F + nu * h.predict(X) # add a shrunken correction
trees.append(h)
def predict(Xn): # start at the mean, add every tree's slice
return y.mean() + nu*sum(t.predict(Xn) for t in trees)
print("train MSE:", round(np.mean((y-F)**2), 3)) # shrinks each roundThe same model in scikit-learn is a single class, with the modern histogram-based variant being the fast one to prefer:
from sklearn.ensemble import HistGradientBoostingRegressor
gb = HistGradientBoostingRegressor(
learning_rate=0.05, max_iter=500,
early_stopping=True, validation_fraction=0.1 # let the data pick #trees
).fit(X, y.ravel())
print("stopped at", gb.n_iter_, "trees")The bars below show the residual magnitude deflating round by round — that shrinking is literally the algorithm working, each tree shaving off another slice of the error its predecessors left behind.
Although squared error makes the gradient equal the residual, the same machinery works for any differentiable loss — log-loss for classification, absolute error or the Huber loss for robust regression. You just fit each new tree to whatever the negative gradient of your loss happens to be; “fit the residuals” is the special case everyone learns first.
Lower the learning rate and raise the number of trees together — they trade off directly. A reliable winning combo is \(\nu = 0.05\) with early stopping on a validation set, which lets the data decide the right number of trees instead of you guessing it.
10.8 — XGBoost / LightGBM / CatBoost (why they dominate tabular)
Plain gradient boosting, as written in §10.7, is slow and prone to overfitting. Three engineered libraries fixed both problems and now win the overwhelming majority of tabular ML competitions while quietly running countless production models. They all share the gradient-boosting core from the previous section — fit trees to gradients, add shrunken slices — and differ in the clever optimizations layered on top.
XGBoost (“eXtreme Gradient Boosting”) earned its dominance with three additions. First, a regularized objective: it adds an \(L_2\) penalty on the leaf output values plus a per-leaf complexity cost \(\gamma\), so the trees are explicitly discouraged from growing reckless, over-confident leaves. Second, a second-order Taylor expansion of the loss: where plain gradient boosting uses only the gradient, XGBoost uses both the gradient and the Hessian (the curvature), which yields better, more stable split decisions. Third, a cache-aware, parallelized, sparsity-aware engine that made all of this genuinely fast on real hardware, including built-in handling of missing values.
That second-order objective is worth seeing once. XGBoost approximates the loss with a Taylor expansion and, writing \(g_i\) for the gradient and \(h_i\) for the Hessian of each row, the optimal value of a leaf \(j\) and the gain of a candidate split come out in closed form:
\[w_j^* = -\frac{\sum_{i\in j} g_i}{\sum_{i\in j} h_i + \lambda}, \qquad \text{Gain} = \tfrac{1}{2}\!\left[\frac{G_L^2}{H_L+\lambda} + \frac{G_R^2}{H_R+\lambda} - \frac{(G_L+G_R)^2}{H_L+H_R+\lambda}\right] - \gamma\]
In words: a leaf’s best output is its rows’ total gradient divided by their total curvature (plus a smoothing \(\lambda\)); a split is worth making only if separating the rows into left and right buys more than the per-split toll \(\gamma\).
Also written: with \(G=\sum g_i\) and \(H=\sum h_i\) over a node, the leaf weight is \(w^*=-G/(H+\lambda)\) and the split gain compares the children’s \(\tfrac12 G^2/(H+\lambda)\) scores against the parent’s.
A quick worked split makes it concrete. Say a node holds rows whose gradients sum to \(G_L=4, H_L=2\) on the left and \(G_R=-6, H_R=3\) on the right, with \(\lambda=1\) and \(\gamma=0.5\). The gain is \(\tfrac12\big[\tfrac{16}{3}+\tfrac{36}{4}-\tfrac{4}{6}\big]-0.5 = \tfrac12[5.33+9.0-0.67]-0.5 \approx 6.3\). Positive, and well above the \(\gamma\) toll — so XGBoost makes this split. Flip the numbers so the two sides have similar gradients and the bracket nearly cancels, the gain dips below \(\gamma\), and the split is rejected. That single number is how every candidate split in the library is judged.
LightGBM chased raw speed on large datasets with two key ideas. Histogram binning buckets each continuous feature into roughly 255 discrete bins, so finding the best split means scanning a few hundred bins instead of every raw value — a massive speedup. GOSS (Gradient-based One-Side Sampling) keeps all the high-gradient “hard” examples but randomly subsamples the low-gradient “easy” ones, focusing compute where the error still lives. Crucially, LightGBM grows trees leaf-wise rather than level-wise (explained below), which is why it is both faster and somewhat more prone to overfitting.
CatBoost set out to solve the categorical feature headache that the others handle awkwardly. Rather than making you one-hot encode, it uses ordered target statistics: it encodes a category by the average target value of the rows that came before the current one in a random permutation. That “only look at earlier rows” trick is what prevents the target leakage that naive mean-encoding suffers, where a category secretly encodes its own row’s label. CatBoost also uses symmetric (oblivious) trees, where every node at a given depth applies the same split test — this acts as built-in regularization and makes inference extremely fast. It is the natural first choice when your data is rich in categorical columns.
The key architectural fork between these libraries is leaf-wise versus level-wise tree growth:
- Level-wise (depth-wise) growth, XGBoost’s traditional default, expands the tree one full level at a time, splitting every node on the current level before going deeper. The trees stay balanced and the approach is more conservative.
- Leaf-wise (best-first) growth, LightGBM’s default, repeatedly splits whichever single leaf anywhere in the tree promises the largest loss reduction. This produces deep, lopsided trees that reach lower loss for the same number of leaves — but it overfits faster, so you cap it with
num_leavesandmax_depth.
| Library | Headline trick | Tree growth | Best when |
|---|---|---|---|
| XGBoost | Regularized loss + 2nd-order gradients | Level-wise (default) | Robust default, medium data, you want control |
| LightGBM | Histogram bins + GOSS | Leaf-wise | Large datasets, need speed |
| CatBoost | Ordered target encoding + oblivious trees | Symmetric | Many categorical features, less tuning |
# All three share a near-identical sklearn-style API:
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
model = XGBClassifier(n_estimators=400, learning_rate=0.05, max_depth=5)
# model.fit(X_train, y_train); model.predict(X_test)In practice you almost always pair these with early stopping on a held-out set so the library, not you, decides how many trees to grow:
from lightgbm import LGBMClassifier, early_stopping
from sklearn.model_selection import train_test_split
Xtr, Xval, ytr, yval = train_test_split(X, y, test_size=0.2, random_state=0)
model = LGBMClassifier(n_estimators=5000, learning_rate=0.03, num_leaves=31)
model.fit(Xtr, ytr, eval_set=[(Xval, yval)],
callbacks=[early_stopping(stopping_rounds=50)])
print("best #trees:", model.best_iteration_) # stops when val loss stallsFor most tabular problems a tuned gradient-boosting library beats a neural network and trains in seconds. Start with LightGBM or XGBoost defaults; tune learning rate, number of trees, and max depth first; and always use early stopping. Reach for CatBoost when categorical columns dominate.
10.9 — Stacking & voting (hard/soft, meta-learner)
Every ensemble so far has combined many copies of one model type. Voting and stacking do something different: they combine different kinds of models — say a random forest, a boosted tree, a logistic regression, and an SVM — to exploit their complementary strengths. Where one model is blind, another often sees, and blending them can beat any single one.
Voting is the simple version. Hard voting takes the majority predicted class label across the models — one model, one vote, count them up. Soft voting instead averages the predicted probabilities across models and picks the class with the highest average. Soft voting usually wins because it uses each model’s confidence: a model that is 90% sure can outweigh two models that are barely past 50%. Here is exactly that situation on a binary problem with three classifiers:
import numpy as np
# predicted P(class=1) from 3 models for one sample:
probs = np.array([0.9, 0.45, 0.40])
hard = int((probs > 0.5).sum() > 1.5) # votes for class 1: only 1 of 3 -> class 0
soft = int(probs.mean() > 0.5) # mean p = 0.583 -> class 1
print("hard vote ->", hard, " soft vote ->", soft)
# hard and soft DISAGREE: under soft voting the confident 0.9 model carries the day.Stacking (stacked generalization) is the powerful version: instead of a fixed averaging rule, it learns how to combine the base models from data. You train several base models, collect their predictions, and feed those predictions as the input features to a final meta-learner (often a simple logistic or linear model) that learns the best blend — perhaps “trust the boosted tree most, but lean on logistic regression when the forest and the tree disagree.”
flowchart TD
D[(Training data)] --> M1[Base: Random Forest]
D --> M2[Base: XGBoost]
D --> M3[Base: Logistic Reg]
M1 --> OOF[Out-of-fold predictions]
M2 --> OOF
M3 --> OOF
OOF --> Meta[Meta-learner<br/>e.g. logistic regression]
Meta --> P[Final prediction]
The animation traces one prediction flowing through the stack: data fans out to several different base models, their answers stream into the meta-learner, which fuses them into one final call. The pulses moving along the wires are the predictions in transit.
The one rule that makes or breaks stacking is how you generate the predictions you feed the meta-learner: they must come from cross-validation, not from the same rows the base models were trained on. If the meta-learner sees in-sample predictions, the base models look artificially flawless on those rows — they have effectively memorized them — and the meta-learner learns to trust an illusion that evaporates in production. This is a classic case of data leakage. Using out-of-fold predictions (each row predicted only by base models that did not train on it) keeps the whole stack honest.
scikit-learn’s StackingClassifier does the out-of-fold cross-validation for you — you just list the base models and the meta-learner:
from sklearn.ensemble import StackingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from lightgbm import LGBMClassifier
stack = StackingClassifier(
estimators=[("rf", RandomForestClassifier(n_estimators=300)),
("gb", LGBMClassifier(n_estimators=300))],
final_estimator=LogisticRegression(),
cv=5, # out-of-fold predictions -> no leakage
).fit(X, y)The single most common stacking bug is feeding the meta-learner the predictions the base models made on their own training rows. Always use out-of-fold (cross-validated) predictions as the meta-features, or the stack overfits silently — it will look spectacular in development and quietly fall apart on live data.
10.10 — Calibrating ensemble probabilities
A model can rank cases perfectly and still lie about how sure it is. Calibration asks a separate question from accuracy: when the model says “70% chance of fraud,” do roughly 70 out of 100 such cases actually turn out to be fraud? If yes, the probabilities are calibrated and you can act on the numbers themselves — set a threshold, compute expected cost, feed them into a downstream decision. If not, the model may still classify well while its probabilities are systematically wrong.
This matters for ensembles specifically because the two big families distort probabilities in opposite, well-known ways:
- Random forests push probabilities toward the middle. A vote of, say, 950-of-1000 trees rarely hits 1.0, because individual trees are noisy and almost always leave a few dissenters — so confident cases get under-stated (you see 0.85 where the truth is 0.99).
- Boosting (AdaBoost especially) pushes probabilities toward the extremes. Driving the exponential/log loss down hard makes the raw scores pile up near 0 and 1, so the model sounds far more certain than it is.
The standard diagnostic is the reliability diagram: bucket predictions by their stated probability and plot, for each bucket, the actual fraction of positives. Perfect calibration is the 45° diagonal; the forest’s curve sags below it in the middle, boosting’s hugs the corners.
The fix is to fit a tiny one-input correction on a held-out set that maps raw scores to honest probabilities. Two methods dominate:
- Platt scaling — fit a logistic regression \(\hat p = \sigma(a\,s + b)\) on the raw score \(s\). Cheap, smooth, great when the distortion is roughly sigmoidal (the classic AdaBoost case).
- Isotonic regression — fit a free-form, non-decreasing step function from score to probability. More flexible, needs more data, can overfit on small sets.
\[\hat p = \frac{1}{1 + e^{-(a\,s + b)}}\]
In words: squash the raw score through a logistic curve whose slope \(a\) and offset \(b\) are tuned so the stated probabilities match observed frequencies.
Also written: \(\hat p = \sigma(a s + b)\) with \(\sigma\) the logistic/sigmoid function — the same link as logistic regression, but applied to the model’s own output instead of to features.
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
Xtr, Xcal, ytr, ycal = train_test_split(X, y, test_size=0.25, random_state=0)
rf = RandomForestClassifier(n_estimators=300).fit(Xtr, ytr)
cal = CalibratedClassifierCV(rf, method="sigmoid", cv="prefit") # Platt; "isotonic" for the step fn
cal.fit(Xcal, ycal) # learn the correction on held-out data, NOT the training rows
p = cal.predict_proba(Xcal)[:, 1]Always calibrate on data the base model did not train on — reusing training rows makes the scores look perfectly calibrated in-sample and silently wrong in production, the same leakage trap as stacking (§10.9). And remember calibration changes the probabilities, not the ranking: it leaves AUC untouched while fixing metrics like log-loss and Brier score (see Model Evaluation & Tuning) that actually depend on the numbers.
10.11 — Isolation forests (ensembles for anomaly detection)
Every ensemble so far has predicted something — a class or a number. Isolation forests repurpose the random-tree machinery for a different job entirely: unsupervised anomaly detection, finding the weird rows when nobody told you which rows are weird. The reframing is clever. Instead of asking “what class is this point,” it asks “how hard is this point to isolate” with random splits.
The intuition is a game of cut-the-data. Repeatedly pick a random feature and a random cut value, slicing the dataset into smaller and smaller boxes, until each point sits alone in its own box. An outlier — sitting far out in empty space — gets fenced off after just a few cuts: there is nothing around it, so almost any random slice separates it. A normal point, buried in a dense crowd, needs many cuts before it is finally alone, because every slice still leaves neighbors in its box. So the average number of cuts to isolate a point is itself the anomaly score — short path means anomaly, long path means normal. No labels, no distance metric, no density estimate required.
The score is normalized against the expected path length \(c(n)\) in a random binary tree, so it lands in \([0,1]\) regardless of dataset size:
\[s(x) = 2^{-\,\dfrac{\mathbb{E}[h(x)]}{c(n)}}\]
In words: take a point’s average isolation depth across the trees, divide by the depth you’d expect for a typical point, and flip it so that shallow isolation (easy to cut off) scores near 1 = anomaly and deep isolation scores near 0 = normal.
Also written: \(s(x)=2^{-\bar h(x)/c(n)}\), where \(\bar h(x)\) is the mean path length over all trees and \(c(n)\approx 2\ln(n-1)+0.577 - 2(n-1)/n\) is the average path length of an unsuccessful binary-search-tree lookup.
from sklearn.ensemble import IsolationForest
import numpy as np
rng = np.random.default_rng(0)
X = np.r_[rng.normal(0, 1, (200, 2)), # 200 normal points
rng.uniform(-6, 6, (10, 2))] # 10 scattered outliers
iso = IsolationForest(n_estimators=200, contamination=0.05, random_state=0).fit(X)
pred = iso.predict(X) # -1 = anomaly, +1 = normal
score = iso.score_samples(X) # lower = more anomalous
print("flagged anomalies:", int((pred == -1).sum()))Because it only does a handful of random cuts per tree and never computes distances, an isolation forest is fast and scales to high dimensions — exactly where distance-based outlier methods (k-NN, LOF) bog down. It is a workhorse for fraud screening, network-intrusion detection, sensor fault-finding, and data-quality checks before training.
The one knob you must set honestly is contamination — your estimate of the fraction of rows that are anomalous — because it sets the cutoff between the −1 and +1 labels. If you only need a ranking of weirdness rather than a hard flag, ignore it and sort by score_samples instead.
10.12 — Bagging vs boosting: the head-to-head
These two are the load-bearing pillars of the whole chapter, and choosing between them comes down to one question: does your base model overfit or underfit? Bagging is the cure for overfitting (high variance); boosting is the cure for underfitting (high bias). Everything else in the table follows from that single distinction.
| Aspect | Bagging (Random Forest) | Boosting (XGBoost / LGBM) |
|---|---|---|
| Trees trained | In parallel, independent | Sequentially, each fixes the last |
| Primarily reduces | Variance | Bias (and some variance) |
| Base learner | Deep, low-bias, overfit trees | Shallow, high-bias stumps |
| Overfitting w/ more trees | Essentially safe | Can overfit — needs learning rate + early stopping |
| Weighting | All trees equal | Trees & data points weighted |
| Parallelizable | Trivially (great for many cores) | Harder (sequential dependency) |
| Tuning effort | Low — robust out of the box | Higher — but higher ceiling |
| Typical accuracy ceiling | Strong | Usually the highest on tabular |
The flowchart below turns this into a decision you can actually follow: diagnose whether a single tree overfits or underfits, pick the matching family, then decide whether you want maximum accuracy (reach for a gradient-boosting library) or want to blend several different model types (reach for stacking).
flowchart TD
Start{Single tree:<br/>overfits or underfits?} -->|overfits = high variance| Bag[Bagging / Random Forest<br/>average independent deep trees]
Start -->|underfits = high bias| Boost[Boosting<br/>sequentially correct errors]
Bag --> Need{Want max accuracy<br/>& willing to tune?}
Boost --> Need
Need -->|yes| GB[Gradient boosting lib<br/>XGBoost / LightGBM / CatBoost]
Need -->|combine diverse models| Stack[Stacking / Voting]
A practical default: start with a random forest to get a strong, no-fuss baseline and a free OOB score, then move to a tuned LightGBM/XGBoost when you need to squeeze out the last few points of accuracy. Use stacking only when you already have several genuinely different good models to blend.
10.13 — Where ensembles win (and where they don’t)
It helps to ground all this theory in the places trees-of-trees actually run. The thread running through every example below is the same: the data lives in a table of rows and columns, the columns mean different things and are on different scales, and the relationship between them is full of sharp thresholds and interactions — exactly what axis-aligned tree splits capture and what a neural net struggles to learn from scratch.
- Credit scoring and fraud detection. Banks score loan default and flag fraudulent transactions with gradient-boosted trees on features like income, balance history, and transaction velocity. Trees handle the mixed numeric/categorical columns natively, shrug off outliers, and — important for regulators — expose feature importances you can audit.
- Click-through-rate and ranking. Ad and search systems rank candidates with boosted trees (LightGBM ships a
LambdaRankobjective built for exactly this). Millions of sparse features, sub-millisecond inference, retrained nightly. - Insurance pricing and risk. Actuarial models on claim frequency and severity lean on boosting for accuracy plus monotonic constraints (force “older car ⇒ higher premium” to hold).
- Scientific tabular data. Genomics, particle physics (the Higgs boson challenge was won with XGBoost), and clinical risk scores — wherever the signal is in a structured table rather than raw pixels or audio.
And the honest flip side — where you should not reach for an ensemble of trees first:
- Images, audio, and raw text. These have spatial or sequential structure that convolutional and transformer networks exploit; trees see only a flat vector and lose. Use deep learning (Chapters 14–18).
- Smooth extrapolation. Trees predict a constant outside the range they saw — they cannot extrapolate a trend. For a clean linear or periodic signal that must extend beyond the training range, a parametric model wins.
- Tiny data with a known functional form. If you have 20 points and physics tells you it’s a straight line, fit the line; an ensemble will just overfit the noise.
The fast heuristic: is your data a spreadsheet? If rows-and-columns with meaningful, heterogeneous features — start with a gradient-boosting library and expect it to be hard to beat. If pixels, waveforms, or token sequences — start with a neural network.
10.14 — Quick reference
| Term / formula | Meaning in one line | When / why it matters |
|---|---|---|
| Weak learner | Model only slightly better than chance (often a stump) | The building block every ensemble stacks into a strong learner |
| Bias–variance split | Error = bias² + variance + noise | Tells you which knob an ensemble can move (bagging→variance, boosting→bias) |
| Crowd-vote tail \(\sum\binom{M}{k}(1-p)^k p^{M-k}\) | Chance a majority of independent voters errs | Shows why many weak voters beat one — if their errors are independent |
| Bagging | Train on bootstrap resamples, then average/vote | Cuts variance of unstable (overfitting) base models only |
| Bootstrap / OOB | Sample \(n\) rows with replacement; ~37% left out | Left-out rows (\(e^{-1}\approx0.368\)) give a free validation score |
| Var floor \(\rho\sigma^2 + \frac{1-\rho}{M}\sigma^2\) | Averaging only kills the uncorrelated slice of variance | Past many trees, lower correlation \(\rho\) — not more trees — is the only lever |
| Random forest | Bagging + random \(m\) features per split | Decorrelates trees (\(m=\sqrt p\) clf, \(p/3\) reg) to drop the variance floor |
| MDI vs permutation importance | Impurity-sum (free, biased) vs shuffle-and-measure (honest) | Use permutation for anything you act on; neither is causal |
| Extra Trees | Random forest + random split thresholds, no bootstrap | More disagreement, faster fit; shines on noisy features |
| Boosting | Sequential learners, each fixes the last’s errors | Cuts bias; needs learning rate + early stopping or it overfits |
| AdaBoost \(\alpha_t=\tfrac12\ln\frac{1-\varepsilon_t}{\varepsilon_t}\) | Reweight rows; trust each stump by its log-odds | First practical booster; fragile to label noise / outliers |
| Gradient boosting \(F_m=F_{m-1}+\nu h_m\) | Fit each tree to the residual (negative gradient) | The dominant tabular method; \(\nu\) trades off vs number of trees |
| Learning rate \(\nu\) | Fraction of each correction tree that is added | Small \(\nu\) + early stopping generalizes best (try 0.05) |
| XGBoost / LightGBM / CatBoost | Regularized 2nd-order / histogram+GOSS / ordered-target boosters | Win most tabular tasks; pick by data size & categorical count |
| Leaf-wise vs level-wise | Split best leaf anywhere vs whole level at a time | Leaf-wise (LGBM) lower loss but overfits faster than level-wise (XGB) |
| Voting (hard/soft) | Majority label vs averaged probability | Soft usually wins — it uses each model’s confidence |
| Stacking | Meta-learner learns to blend base models | Powerful, but must use out-of-fold preds or it leaks |
| Calibration (Platt / isotonic) | Map raw scores to honest probabilities | Forests under-confident, boosters over-confident; fixes log-loss/Brier, not AUC |
| Isolation forest \(s(x)=2^{-\bar h(x)/c(n)}\) | Short random-cut path = anomaly | Fast, high-dim unsupervised outlier detection, no labels needed |
10.15 — Key takeaways
- Ensembles combine many weak learners into one strong predictor; their power comes from the bias–variance decomposition — different ensemble families attack different error terms.
- Bagging trains independent trees on bootstrap samples and averages them, cutting variance. It only helps for unstable (overfitting) base models. Random forests add per-split feature randomness to decorrelate trees — pushing past the \(\rho\sigma^2\) variance floor — and throw in free OOB error and feature importances. Extra Trees randomize the split threshold too, dropping correlation further and training faster.
- Boosting trains learners sequentially, each correcting the last, cutting bias. AdaBoost reweights the data points (and is sensitive to label noise; it’s forward stagewise fitting under exponential loss); gradient boosting fits each tree to the residuals (the negative gradient of the loss).
- XGBoost, LightGBM, CatBoost are the production-grade gradient boosters that dominate tabular ML — via regularization, histogram bins, second-order gradients, and smart categorical handling; leaf-wise growth (LightGBM) is faster but overfits more than level-wise (XGBoost).
- Voting combines diverse models by majority label (hard) or averaged probability (soft); stacking learns the blend with a meta-learner — and must use out-of-fold predictions to avoid leakage.
- Ensemble probabilities need calibration: forests are under-confident, boosters over-confident; fix the numbers (not the ranking) with Platt scaling or isotonic regression on held-out data. Isolation forests reuse random trees for unsupervised anomaly detection — short isolation paths mark the outliers.
- Ensembles of trees dominate tabular/structured data (credit, fraud, ranking, science) but lose to neural networks on images, audio, and text, and cannot extrapolate beyond their training range.
- Rule of thumb: bag what overfits, boost what underfits; control boosting with a small learning rate + early stopping.
10.16 — See also
- Chapter 09 — Classification Algorithms — decision trees, the base learner every ensemble in this chapter is built from.
- Chapter 08 — Regression — the residual-fitting view of gradient boosting builds directly on regression loss.
- Chapter 12 — Model Evaluation & Tuning — cross-validation, early stopping, calibration metrics (Brier, log-loss), and the hyperparameter search these ensembles depend on.
- Chapter 03 — Optimization — gradient descent, the engine gradient boosting runs in function space.
- Chapter 01 — Linear Algebra & Chapter 04 — Probability & Statistics — the bias–variance math and the bootstrap that underpin the whole chapter.
- Chapter 36 — Explainable AI & Interpretability — SHAP and permutation importance for opening up these black-box ensembles.
- Chapter 37 — Causal Inference — why feature importance is not causal effect.
↪ The thread continues → Chapter 11 · 🔮 Clustering & Unsupervised Learning
So far every model learned from labeled answers. Strip the labels away and a new challenge appears — finding hidden structure on your own — the world of clustering and unsupervised learning.
📖 All chapters | ← 09 · 📐 Classification Algorithms | 11 · 🔮 Clustering & Unsupervised Learning →