Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 22.1 — What forecasting is (horizons, baselines, the golden rule)
  • 22.2 — Time series components and stationarity/differencing
  • 22.3 — Autocorrelation (ACF / PACF)
  • 22.4 — Classical models (moving average, exponential smoothing / Holt-Winters, ARIMA/SARIMA)
  • 22.5 — Feature-based ML for time series (lag/window features with gradient boosting)
  • 22.6 — Deep forecasting (RNN/LSTM, temporal CNNs, Transformers, Prophet, DeepAR)
  • 22.7 — State-space models and the Kalman filter
  • 22.8 — Multivariate and probabilistic forecasting
  • 22.9 — Proper time-series cross-validation (no shuffling)
  • 22.10 — Evaluation (MAE, MAPE, RMSE)
  • 22.11 — Hierarchical forecasting and reconciliation
  • 22.12 — Anomalies, missing data, and intermittent demand
  • 22.13 — Foundation models for time series
  • 22.14 — Quick reference
  • 22.15 — Key takeaways
  • 22.16 — See also

Chapter 22 — ⏳ Time Series & Forecasting

📖 All chapters  |  ← 21 · 🔊 Speech & Audio Processing  |  23 · 📚 Large Language Models →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

A time series is data indexed by time: a sequence of observations \(y_1, y_2, \dots, y_T\) recorded at successive, usually evenly spaced, instants. Forecasting asks the natural next question — given the past, what comes next? This subfield sits inside Applied AI but borrows heavily from statistics: its defining feature is that order matters and observations are not independent, which breaks almost every assumption that ordinary supervised learning relies on. Sales, electricity demand, server load, heart rate, stock prices, weather — anything that unfolds over time lives here.

🧭 In context: Applied AI built on probability & statistics · used to predict future values from ordered past observations · the one key idea: exploit temporal structure (trend, season, autocorrelation) and never let the future leak into the past.

💡 Remember this: Exploit the temporal structure (trend, season, autocorrelation) but never let any information from after time \(t\) touch a forecast made at \(t\) — and always make your model beat a naive baseline before you trust it.

22.1 — What forecasting is (horizons, baselines, the golden rule)

Before any model, get the vocabulary and the mindset straight. Picture standing on a riverbank watching the water flow past: everything upstream has already happened (your history), you are standing at now (the forecast origin), and you are trying to call what the water will do downstream (the horizon). You can only use water you have already seen — you cannot reach downstream and peek. That single picture is the whole discipline.

A few terms recur in every section, so pin them down once:

  • Forecast origin (\(t\)) — “now,” the last point you are allowed to know.
  • Horizon (\(h\)) — how far ahead you predict. \(\hat y_{t+1}\) is a one-step forecast; \(\hat y_{t+h}\) is an \(h\)-step forecast. Error almost always grows with \(h\).
  • In-sample vs. out-of-sample — fit on history (in-sample), judge on data the model never saw (out-of-sample). Only the second tells you anything about the future.
  • Point vs. probabilistic — a single number versus a whole distribution or interval (Section 22.7).

The intuition for why error grows with the horizon: a one-step forecast leans on a value you actually observed; a ten-step forecast must stack guess upon guess, and each guess feeds the next, so uncertainty compounds. This is why every honest long-horizon forecast comes with a widening band, never a confident single line.

Uncertainty compounds: the forecast band fans out with the horizon observed history forecast origin → horizon h

There are two ways to push a model multiple steps ahead, and they trade off accuracy against speed:

  • Recursive (iterated) — forecast one step, feed that forecast back in as if it were real, forecast the next, and so on. Simple, but errors compound as fake inputs pile up.
  • Direct — train a separate model for each horizon (\(h=1,2,\dots\)), each predicting that far ahead in one shot. No compounding, but more models to fit.

flowchart LR
  subgraph Recursive
    A1["ŷ₍t+1₎"] --> A2["feed back"] --> A3["ŷ₍t+2₎"] --> A4["ŷ₍t+3₎"]
  end
  subgraph Direct
    B0["xₜ"] --> B1["model h=1 → ŷ₍t+1₎"]
    B0 --> B2["model h=2 → ŷ₍t+2₎"]
    B0 --> B3["model h=3 → ŷ₍t+3₎"]
  end

The golden rule, stated once for the whole chapter: never let information from time \(> t\) touch a forecast made at time \(t\). Every leakage bug, every too-good-to-be-true validation score, every model that dazzles in the notebook and dies in production traces back to breaking this one rule. It governs how you build features (Section 22.5), how you validate (Section 22.8), and how you read every plot.

Always start with a naive baseline

The most important model in forecasting is the one that does almost nothing — because it is the bar everything else must clear. Two naive forecasts:

  • Naive (random-walk): tomorrow equals today, \(\hat y_{t+1} = y_t\).
  • Seasonal naive: this period equals the same period one season ago, \(\hat y_{t+h} = y_{t+h-m}\).

In words: the cheapest honest guess is “nothing changes” (or “this December looks like last December”). Also written: the naive forecast is the special case of an \(h\)-step random walk, \(\hat y_{t+h} = y_t\) for all \(h\), with forecast variance growing like \(h\,\sigma^2\).

import numpy as np
y = np.array([10, 12, 11, 13, 15, 14, 16, 18], dtype=float)
naive_next     = y[-1]            # "tomorrow = today"
m = 4
seasonal_naive = y[-m]           # "= same point one season ago"
print(naive_next, seasonal_naive)   # 18.0  14.0

If your gradient-boosted, hyper-tuned, GPU-trained forecaster cannot beat “tomorrow equals today,” it is not adding value — it is adding cost. This baseline reappears as the denominator of MASE in Section 22.9, where beating naive becomes a single number below 1.

Tip

Workflow in one breath: plot the series → compute a naive baseline → decompose and check stationarity → fit a candidate model → validate walk-forward → compare to the baseline on a held-out tail. Skip the baseline and you have no idea whether your model is good or merely complicated.

22.2 — Time series components and stationarity/differencing

The first move in any time-series analysis is to mentally decompose the series into a few interpretable pieces. Think of a song: there’s the slow swell of the whole track (trend), a chorus that comes back every so many bars (seasonality), longer mood shifts that don’t keep a strict beat (cycles), and random crackle (noise). Pulling these apart is what makes the rest of the chapter possible. The classic decomposition splits an observation into four ingredients:

  • Trend (\(T_t\)) — the slow, long-run direction (sales rising year over year).
  • Seasonality (\(S_t\)) — a fixed-period repeating pattern (ice-cream sales peak every summer; traffic peaks every weekday morning).
  • Cyclic (\(C_t\)) — wave-like swings with no fixed period (business cycles, lasting a variable number of years). The key contrast with seasonality is the fixed-versus-variable period.
  • Noise / residual (\(\varepsilon_t\)) — the irregular leftover.

Two combination rules dominate. An additive model assumes the pieces add up, \(y_t = T_t + S_t + \varepsilon_t\), appropriate when the seasonal swing is roughly constant in size. A multiplicative model, \(y_t = T_t \times S_t \times \varepsilon_t\), fits when the swing grows with the level (a 10% holiday bump, not a fixed +500 units). Taking logs turns multiplicative into additive: \(\log y_t = \log T_t + \log S_t + \log \varepsilon_t\).

In words: an observation is the trend at that moment, plus (or times) the season’s push, plus the irregular wobble. Also written: in expectation-free residual form, \(\varepsilon_t = y_t - T_t - S_t\) (additive) or \(\varepsilon_t = y_t / (T_t S_t)\) (multiplicative).

A quick worked feel for the difference: suppose the underlying level climbs \(100 \to 200\) and the seasonal effect is “December is high.” In an additive world December always adds \(+20\), so the peaks read \(120\) then \(220\) — same absolute bump. In a multiplicative world December multiplies by \(1.2\), so the peaks read \(120\) then \(240\) — the bump grows with the level. Seeing seasonal swings fan out like that is your cue to log-transform first.

Here is a small additive series and its decomposition drawn out:

Observed yₜ = Tₜ + Sₜ + εₜ Trend Tₜ (slow rise) Seasonality Sₜ (fixed period = 3) Noise εₜ (irregular)

In practice you rarely peel these apart by hand. The standard tool is STL (Seasonal-Trend decomposition using Loess), which robustly estimates the three components even when the seasonal shape drifts slowly over time:

import numpy as np, pandas as pd
from statsmodels.tsa.seasonal import STL

idx = pd.date_range("2021-01-01", periods=48, freq="MS")
y = pd.Series(np.arange(48) * 0.5                       # trend
              + 5 * np.sin(2 * np.pi * idx.month / 12)  # yearly season
              + np.random.default_rng(0).normal(0, 1, 48), index=idx)

res = STL(y, period=12, robust=True).fit()
# res.trend, res.seasonal, res.resid are aligned Series you can plot or subtract
print(res.seasonal.head(3).round(2))

Stationarity is the property that makes a series predictable in a stable way. Plain version: a stationary series is one whose “personality” never changes — its typical level, how much it bounces around, and how today relates to yesterday all stay the same no matter which stretch of time you look at. Formally, a series is (weakly) stationary if its mean, variance, and autocorrelation do not change over time — the statistical “rules” stay put. A series with a trend is non-stationary (mean drifts); one with growing volatility is non-stationary (variance drifts). Most classical models assume stationarity, so the workflow is: transform the series until it looks stationary, model the stationary version, then invert the transform for the final forecast.

A picture is worth a definition here — stationary on the left, three flavors of non-stationary on the right:

Stationary vs. non-stationary stationary (stable mean & spread) mean drifts (trend) variance grows seasonal (period-dependent rules) → difference / log until the right column looks like the left one

The workhorse transform is differencing: replace each value by its change, \(\nabla y_t = y_t - y_{t-1}\). Differencing kills a linear trend the way a derivative flattens a line. For seasonality of period \(m\), use seasonal differencing \(\nabla_m y_t = y_t - y_{t-m}\). Variance that grows with level is tamed by a log first.

In words: instead of modeling the value, model how much it changed since last step — a straight-line trend becomes a flat constant once you look at the changes. Also written: with the lag (backshift) operator \(B\) where \(B y_t = y_{t-1}\), a first difference is \(\nabla y_t = (1-B)y_t\), and \(d\) differences is \((1-B)^d y_t\); seasonal differencing is \((1-B^m)y_t\).

import numpy as np
# tiny series with a clear upward trend
y = np.array([10, 12, 13, 16, 18, 21, 22, 25], dtype=float)
d1 = np.diff(y)                       # first difference removes the trend
print("first diff:", d1)              # [2 1 3 2 3 1 3]  -> roughly constant, no trend
# seasonal differencing, period m=4, on a different series:
ys = np.array([5, 8, 6, 9, 7, 10, 8, 11], dtype=float)
seasonal = ys[4:] - ys[:-4]           # yₜ - yₜ₋₄
print("seasonal diff:", seasonal)     # [2 2 2 2] -> season removed, flat

A close cousin worth knowing: the Box-Cox transform generalizes the “take a log to tame variance” trick into a tunable knob. It applies \(y^{(\lambda)} = (y^\lambda - 1)/\lambda\) for \(\lambda \ne 0\) and \(\log y\) for \(\lambda = 0\), then a search picks the \(\lambda\) that most stabilizes the variance. Logging is just the \(\lambda = 0\) special case; the general transform handles series whose spread grows in a milder-than-multiplicative way.

The Augmented Dickey-Fuller (ADF) test formalizes the eyeball check: its null hypothesis is “the series has a unit root (is non-stationary),” so a small p-value (say < 0.05) lets you reject non-stationarity and proceed. A useful companion is the KPSS test, whose null is the opposite (“the series is stationary”); running both guards against being fooled — if ADF says non-stationary and KPSS says non-stationary, you almost certainly need to difference.

from statsmodels.tsa.stattools import adfuller, kpss
import numpy as np
rng = np.random.default_rng(0)
trend = np.cumsum(rng.normal(0.3, 1, 200))         # random walk + drift (non-stationary)
print("ADF p =", round(adfuller(trend)[1], 3))     # large p -> can't reject unit root
print("ADF p (differenced) =", round(adfuller(np.diff(trend))[1], 3))  # small p -> stationary
Tip

Rule of thumb: difference once and re-check before differencing again. One regular difference removes a linear trend; one seasonal difference removes a season. Over-differencing injects artificial negative autocorrelation and inflates variance — more is not better.

22.3 — Autocorrelation (ACF / PACF)

Because time-series observations are correlated with their own past, the central diagnostic is the autocorrelation function (ACF): the correlation between the series and a lagged copy of itself. Intuition: slide a tracing-paper copy of the series sideways by \(k\) steps and ask “how well does the shifted copy still line up with the original?” At lag \(k\),

\[\rho_k = \frac{\sum_{t=k+1}^{T}(y_t-\bar y)(y_{t-k}-\bar y)}{\sum_{t=1}^{T}(y_t-\bar y)^2}.\]

In words: the fraction of the series’ total wiggle that today’s value shares with the value \(k\) steps ago. Also written: as a ratio of covariance to variance, \(\rho_k = \gamma_k / \gamma_0\), where \(\gamma_k = \tfrac{1}{T}\sum (y_t-\bar y)(y_{t-k}-\bar y)\) is the lag-\(k\) autocovariance.

\(\rho_0 = 1\) always (a series correlates perfectly with itself). A slow, gradual decay of \(\rho_k\) signals a trend; a spike at lag \(m\) and its multiples signals seasonality of period \(m\).

The partial autocorrelation function (PACF) at lag \(k\) measures the correlation between \(y_t\) and \(y_{t-k}\) after removing the influence of the intermediate lags \(y_{t-1},\dots,y_{t-k+1}\). The intuition: lag-2 correlation might exist only because lag-1 carries it forward; PACF strips out that pass-through and shows the direct link. (Everyday version: your grandfather influences you mostly through your parent — PACF asks how much grandpa matters once you already know the parent.)

Why two functions? Together they fingerprint the model order. The textbook rule:

Pattern ACF PACF Suggests
AR(p) tails off gradually cuts off after lag \(p\) autoregressive, order \(p\)
MA(q) cuts off after lag \(q\) tails off gradually moving-average, order \(q\)
White noise all ~0 all ~0 nothing to model

A worked ACF by hand on a tiny series:

import numpy as np
y = np.array([2, 4, 6, 5, 7, 9, 8, 10], dtype=float)
y = y - y.mean()
def acf(y, k):
    return np.sum(y[k:] * y[:-k]) / np.sum(y**2)
for k in range(1, 4):
    print(f"lag {k}: rho = {acf(y, k):.3f}")
# lag 1: 0.486   lag 2: 0.150   lag 3: -0.131  -> positive lag-1, fading: trend-ish

In real work you call a library and read the plot rather than the numbers:

import numpy as np
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.stattools import acf, pacf

y = np.sin(np.arange(120) * 2 * np.pi / 12) + np.random.default_rng(0).normal(0, 0.3, 120)
print("ACF lags 0-4:", np.round(acf(y, nlags=4), 3))
print("PACF lags 0-4:", np.round(pacf(y, nlags=4), 3))
# plot_acf(y); plot_pacf(y)  # adds the ±1.96/√T significance bands automatically

Significance is judged against confidence bands at roughly \(\pm 1.96/\sqrt{T}\); spikes poking outside the band are “real.” With \(T=8\) above, the band sits near \(\pm 0.69\), so even the lag-1 value of \(0.49\) is inside it — a healthy reminder that short series rarely give statistically firm autocorrelations, and you should not over-read a wiggly ACF on a handful of points.

ACF — gradual tail-off (AR signature) 95% band 123 456 7 lag k
Warning

ACF and PACF assume a roughly stationary series. Run them on a trending series and you get a slowly decaying ACF that tells you only “there’s a trend” — difference first, then read the plots for model order.

22.4 — Classical models (moving average, exponential smoothing / Holt-Winters, ARIMA/SARIMA)

Moving average smoothing

The simplest forecast is a simple moving average (SMA): predict the next value as the mean of the last \(w\) observations, \(\hat y_{t+1} = \frac{1}{w}\sum_{i=0}^{w-1} y_{t-i}\). It is a low-pass filter — it smooths noise but lags behind trends and treats a point from \(w\) steps ago exactly like yesterday’s.

In words: guess that tomorrow looks like the average of the last few days. Also written: as a rolling update, \(\hat y_{t+1} = \hat y_t + \tfrac{1}{w}(y_t - y_{t-w})\) — drop the oldest point, add the newest.

import numpy as np
y = np.array([10, 12, 11, 13, 15, 14, 16], dtype=float)
w = 3
sma_next = y[-w:].mean()              # forecast for t+1
print(sma_next)                       # (15+14+16)/3 = 15.0

Exponential smoothing and Holt-Winters

Exponential smoothing fixes the SMA’s flat memory by weighting recent points more, with weights that decay geometrically into the past. Think of it as a running impression that you nudge toward each new data point — recent surprises move it a lot, ancient history barely at all. Simple exponential smoothing (SES) maintains a level \(\ell_t\):

\[\ell_t = \alpha\, y_t + (1-\alpha)\,\ell_{t-1}, \qquad \hat y_{t+1} = \ell_t,\]

In words: new level = a slice \(\alpha\) of today’s actual blended with the rest \((1-\alpha)\) of yesterday’s belief. Also written: as an error-correction step, \(\ell_t = \ell_{t-1} + \alpha\,(y_t - \ell_{t-1})\) — move the old level a fraction \(\alpha\) of the way toward the new observation.

where the smoothing parameter \(\alpha \in (0,1)\) tunes reactivity — high \(\alpha\) chases recent data, low \(\alpha\) is sluggish and stable. Unrolling the recursion shows why it is called exponential: \(\ell_t = \alpha y_t + \alpha(1-\alpha)y_{t-1} + \alpha(1-\alpha)^2 y_{t-2} + \dots\) — every past point contributes, but its weight decays by a factor \((1-\alpha)\) each step back. SES has no notion of trend, so its forecast is a flat line.

The error-correction view in motion — each new observation tugs the running level a fraction \(\alpha\) of the way toward it, then it eases over and waits for the next:

SES: the level eases toward each new point by a step of size α observations yₜ running level ℓₜ

The decaying-weight idea, drawn out — each bar is how much one past observation counts toward today’s level:

Exponential smoothing weights (α = 0.5): recent points count most yₜyₜ₋₁yₜ₋₂ yₜ₋₃yₜ₋₄yₜ₋₅ weight = α(1−α)ᵏ

Holt’s linear method adds a trend component \(b_t\) (a second smoothing equation with parameter \(\beta\)). Holt-Winters adds the third piece — a seasonal component \(s_t\) with parameter \(\gamma\) — giving the full triple exponential smoothing that handles level, trend, and season at once. The additive-seasonal forecast \(h\) steps ahead is \(\hat y_{t+h} = \ell_t + h\,b_t + s_{t+h-m}\).

In words: project the current level, push it \(h\) steps along the current trend, then add back the seasonal bump for that point in the cycle. Also written: for the multiplicative-seasonal variant the season scales rather than adds, \(\hat y_{t+h} = (\ell_t + h\,b_t)\, s_{t+h-m}\).

This whole family of “which components do I switch on” choices is formalized as ETS (Error, Trend, Seasonal) state-space models, where each of the three slots is independently None / Additive / Multiplicative — SES is ETS(A,N,N), Holt is ETS(A,A,N), additive Holt-Winters is ETS(A,A,A). The ETS framing lets software pick the best combination by information criteria automatically.

# Simple exponential smoothing from scratch
import numpy as np
y = np.array([10, 12, 11, 13, 15, 14, 16], dtype=float)
alpha = 0.4
level = y[0]
for t in range(1, len(y)):
    level = alpha * y[t] + (1 - alpha) * level
print("forecast t+1:", round(level, 2))   # ~14.0, weighted toward recent points

The same Holt-Winters model in statsmodels, fit and forecast in a few lines:

import numpy as np, pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing

idx = pd.date_range("2021-01-01", periods=48, freq="MS")
y = pd.Series(np.arange(48) * 0.4 + 5 * np.sin(2 * np.pi * idx.month / 12) + 10, index=idx)

model = ExponentialSmoothing(y, trend="add", seasonal="add", seasonal_periods=12).fit()
print(model.forecast(6).round(2))   # 6-month-ahead forecast, level+trend+season

ARIMA / SARIMA

ARIMA(p, d, q) is the cornerstone of classical forecasting. It glues three ideas:

  • AR(p) — AutoRegressive: regress \(y_t\) on its own \(p\) past values. \(y_t = c + \phi_1 y_{t-1} + \dots + \phi_p y_{t-p} + \varepsilon_t\).
  • I(d) — Integrated: difference the series \(d\) times to make it stationary (Section 22.2).
  • MA(q) — Moving Average: regress on the \(q\) past forecast errors. \(y_t = c + \varepsilon_t + \theta_1\varepsilon_{t-1} + \dots + \theta_q\varepsilon_{t-q}\).

In words: predict today from a recipe of recent values (AR) and recent surprises (MA), after differencing enough times (I) that the series stops drifting. Also written: compactly with the backshift operator, \(\phi(B)\,(1-B)^d y_t = c + \theta(B)\,\varepsilon_t\), where \(\phi(B)=1-\phi_1 B-\dots-\phi_p B^p\) and \(\theta(B)=1+\theta_1 B+\dots+\theta_q B^q\).

So ARIMA(1,1,0) means: difference once, then model the differenced series with one autoregressive term. You read the candidate \(p\) and \(q\) straight off the PACF and ACF (Section 22.3). SARIMA(p,d,q)(P,D,Q)\(_m\) bolts on a seasonal copy of the same machinery at lag \(m\) — seasonal AR, seasonal differencing, and seasonal MA — to capture yearly or weekly cycles.

A tiny worked AR(1) step makes the machinery concrete. Suppose a fitted model is \(\hat y_t = 2 + 0.6\,y_{t-1}\) and yesterday’s value was \(y_{t-1}=10\). Then the one-step forecast is \(2 + 0.6\times 10 = 8\). To go two steps out you feed the forecast back in: \(2 + 0.6\times 8 = 6.8\). Notice the forecasts march toward the model’s long-run mean \(c/(1-\phi)=2/0.4=5\) — a stationary AR process always reverts to its mean, which is exactly why non-stationary (trending) series must be differenced first.

flowchart LR
  A[Raw series] --> B{Stationary?}
  B -- no --> C["Difference d times (I)"]
  C --> B
  B -- yes --> D["Read ACF/PACF<br/>pick p, q"]
  D --> E["Fit ARIMA p,d,q"]
  E --> F["Seasonal?<br/>add P,D,Q,m → SARIMA"]
  F --> G[Forecast + invert differencing]

Rather than hand-picking \((p,d,q)\), practitioners often let a search routine grade candidates by information criteria (AIC/BIC, which reward fit but penalize extra parameters):

import numpy as np, pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX

idx = pd.date_range("2021-01-01", periods=60, freq="MS")
y = pd.Series(np.arange(60) * 0.3 + 4 * np.sin(2 * np.pi * idx.month / 12) + 20, index=idx)

# SARIMA(1,1,1)(1,1,0)_12 — non-seasonal + seasonal differencing and terms
model = SARIMAX(y, order=(1, 1, 1), seasonal_order=(1, 1, 0, 12)).fit(disp=False)
print("AIC:", round(model.aic, 1))
print(model.forecast(6).round(2))   # 6-step forecast (differencing inverted internally)
# pmdarima.auto_arima(y, seasonal=True, m=12) automates the (p,d,q)(P,D,Q) search by AIC
Model Captures Misses Use when
Moving average local level trend, season quick smoothing baseline
SES level trend, season flat, noisy series
Holt-Winters level + trend + season non-linear effects clear seasonal pattern, one series
ARIMA autocorrelation + trend seasonality (use SARIMA) stationary-after-differencing series
SARIMA the above + seasonality exogenous drivers seasonal series, classic baseline
Tip

Holt-Winters versus SARIMA: both nail level/trend/season. Reach for Holt-Winters when you want a fast, robust, few-parameter fit; reach for SARIMA when you want a principled, residual-diagnosable model and possibly exogenous regressors (SARIMAX). Either makes a strong baseline that deep models must beat to justify themselves.

Residual diagnostics — did the model catch everything?

Fitting a classical model is only half the job; you must check that the residuals look like white noise. The logic is simple: a good model has squeezed out every predictable pattern, so whatever’s left should be structureless. If the residual ACF still shows spikes, there’s signal you failed to model. The formal test is the Ljung-Box test, whose null is “residuals are independent up to lag \(k\)”; a large p-value is the good outcome here (you fail to reject independence).

import numpy as np, pandas as pd
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.stats.diagnostic import acorr_ljungbox

idx = pd.date_range("2021-01-01", periods=60, freq="MS")
y = pd.Series(np.arange(60)*0.3 + 4*np.sin(2*np.pi*idx.month/12) + 20, index=idx)
res = SARIMAX(y, order=(1,1,1), seasonal_order=(1,1,0,12)).fit(disp=False)
lb = acorr_ljungbox(res.resid, lags=[12], return_df=True)
print(lb.round(3))    # large lb_pvalue -> residuals look like noise -> model is adequate

22.5 — Feature-based ML for time series (lag/window features with gradient boosting)

A different philosophy: stop using a special time-series model and instead reshape the forecasting problem into ordinary tabular supervised learning, then throw a strong regressor — usually gradient-boosted trees (XGBoost, LightGBM; see Ensemble Methods) — at it. This is often the pragmatic winner on real business data with calendars, promotions, and many related series.

The trick is feature engineering the temporal structure into columns. The mental model: each row of a spreadsheet is “a moment in time,” and you fill its columns with everything you’d have known just before that moment — what the value was yesterday, last week, its recent average, whether it’s a holiday — then let an ordinary regressor learn the mapping.

  • Lag features — past values as predictors: \(y_{t-1}, y_{t-7}, y_{t-365}\) (yesterday, last week, last year).
  • Window / rolling features — summaries over a trailing window: rolling mean, std, min, max over the last 7 days.
  • Calendar features — day-of-week, month, is-holiday, is-weekend, encoded so the tree can split on them.
  • Fourier / cyclical features — encode an angle like “hour of day” as a \((\sin, \cos)\) pair so that hour 23 sits right next to hour 0, instead of as a raw integer where 23 and 0 look maximally far apart. This is the cheap way to hand a tree (or a linear model) smooth seasonality.

The target is \(y_t\); each row is one timestamp with its lags and windows filled from strictly past data.

import numpy as np, pandas as pd
y = pd.Series([10,12,11,13,15,14,16,18,17,19], name="y")
df = pd.DataFrame({"y": y})
df["lag1"] = df["y"].shift(1)              # yesterday
df["lag2"] = df["y"].shift(2)
df["roll3_mean"] = df["y"].shift(1).rolling(3).mean()   # mean of last 3, shifted to avoid leak
print(df.dropna().head())
#     y  lag1  lag2  roll3_mean
# 3  13  11.0  12.0       11.00
# 4  15  13.0  11.0       12.00   <- predict y from past-only features

The single most important detail is in that .shift(1) before .rolling(): every feature for time \(t\) must use only data available before \(t\). A rolling mean that includes \(y_t\) itself leaks the answer.

Cyclical encoding in one line, since it’s the most reused trick:

import numpy as np, pandas as pd
hour = pd.Series([0, 6, 12, 18, 23])
sin_h = np.sin(2*np.pi*hour/24)
cos_h = np.cos(2*np.pi*hour/24)
# hour 23 and hour 0 are now neighbors on the unit circle, not 23 apart
print(np.round(sin_h, 2).tolist(), np.round(cos_h, 2).tolist())

Once the table is built, fitting the regressor is the easy part:

import numpy as np, pandas as pd
from lightgbm import LGBMRegressor   # or: from xgboost import XGBRegressor

rng = np.random.default_rng(0)
y = pd.Series(np.arange(300) * 0.1 + 5 * np.sin(np.arange(300) * 2*np.pi/7) + rng.normal(0,1,300))
df = pd.DataFrame({"y": y})
for L in (1, 7, 14):
    df[f"lag{L}"] = df["y"].shift(L)
df["roll7"] = df["y"].shift(1).rolling(7).mean()
df = df.dropna()

X, target = df.drop(columns="y"), df["y"]
cut = int(len(df) * 0.8)                       # time-ordered split, NEVER random
model = LGBMRegressor(n_estimators=200).fit(X.iloc[:cut], target.iloc[:cut])
pred = model.predict(X.iloc[cut:])
print("MAE:", round(np.mean(np.abs(pred - target.iloc[cut:])), 3))

flowchart LR
  A["Series yₜ"] --> B["Build lag cols<br/>yₜ₋₁, yₜ₋₇ ..."]
  A --> C["Rolling stats<br/>mean/std (shifted)"]
  A --> D["Calendar<br/>dow, month, holiday"]
  B --> E["Tabular X, y"]
  C --> E
  D --> E
  E --> F["Gradient boosting<br/>XGBoost / LightGBM"]
  F --> G["ŷₜ"]

The cost of this convenience: trees cannot extrapolate. A boosted model predicting from lag features will never forecast a value higher than anything it saw in training, because a tree leaf outputs a constant. If your series trends upward forever, either model the differenced series (predict the change, not the level) or detrend first.

Trees can’t extrapolate — leaf output is capped at the training max training range true trend tree forecast (flat past max)
Warning

The number-one bug in feature-based forecasting is lookahead leakage: a rolling mean, a target-encoded category, or a fillna computed over the whole series quietly mixes future into past. Compute every feature causally, and validate with time-ordered splits (Section 22.8).

22.6 — Deep forecasting (RNN/LSTM, temporal CNNs, Transformers, Prophet, DeepAR)

When you have many related series and lots of history, deep models can learn shared patterns no single classical fit would find. The intuition for “global” models: instead of fitting one curve per product, you train a single network on thousands of products at once, so the rare product with three months of history borrows the seasonal shape learned from its better-observed cousins. The major families:

RNNs / LSTMs process the sequence step by step, carrying a hidden state that summarizes the past (see Recurrent & Sequence Models). The LSTM gating mechanism lets it remember long-range structure — useful for long seasonal lags — at the cost of slow sequential training.

Temporal Convolutional Networks (TCNs) apply causal convolutions (filters that only look backward, never at future timesteps) stacked with dilations so the receptive field grows exponentially with depth — a few layers can see thousands of steps back, and unlike RNNs they train in parallel.

Dilated causal convolution — receptive field grows with depth output ŷ dilation 2 input

Transformers replace recurrence with self-attention (see Attention & Transformers), letting every timestep attend to every other in one shot — strong at long horizons. Time-series-specific variants (Informer, Autoformer, PatchTST, TFT) add tricks to tame attention’s quadratic cost and inject seasonality. A surprising research finding is that a well-tuned linear model (DLinear) sometimes matches these — a standing reminder to always benchmark fancy against simple.

Prophet (from Meta) is the friendly outlier — not deep at all, but a decomposable curve-fitting model: \(y_t = g(t) + s(t) + h(t) + \varepsilon_t\), an explicit trend \(g\) (with changepoints), Fourier-series seasonality \(s\), and holiday effects \(h\). It is robust, handles missing data, needs little tuning, and gives interpretable components — the go-to for “good enough, fast, with business calendars.”

DeepAR (from Amazon) trains a single autoregressive RNN globally across thousands of related series and outputs a probability distribution for each future step, not a point — naturally giving prediction intervals (Section 22.7).

A minimal LSTM forecaster in PyTorch — the windowing into (input, next-step) pairs is the part unique to forecasting:

import torch, torch.nn as nn, numpy as np

series = np.sin(np.arange(400) * 2 * np.pi / 20).astype("float32")
W = 20                                            # look-back window length
X = np.stack([series[i:i+W] for i in range(len(series)-W)])      # (N, W)
ynext = series[W:]                                                # (N,)
X = torch.tensor(X).unsqueeze(-1)                 # (N, W, 1) -> features per step
ynext = torch.tensor(ynext).unsqueeze(-1)

class LSTMForecaster(nn.Module):
    def __init__(self, hidden=32):
        super().__init__()
        self.lstm = nn.LSTM(1, hidden, batch_first=True)
        self.head = nn.Linear(hidden, 1)
    def forward(self, x):
        out, _ = self.lstm(x)                     # out: (N, W, hidden)
        return self.head(out[:, -1])              # use last step's state -> (N, 1)

model = LSTMForecaster()
opt = torch.optim.Adam(model.parameters(), lr=1e-2)
loss_fn = nn.MSELoss()
for epoch in range(50):                           # tiny demo loop
    opt.zero_grad()
    loss = loss_fn(model(X), ynext)
    loss.backward(); opt.step()
print("final train MSE:", round(loss.item(), 5))

And Prophet for the “fast, calendar-aware, interpretable” case:

import pandas as pd, numpy as np
from prophet import Prophet

idx = pd.date_range("2021-01-01", periods=730, freq="D")
df = pd.DataFrame({"ds": idx,
                   "y": np.arange(730)*0.02 + 5*np.sin(2*np.pi*idx.dayofyear/365) + 20})
m = Prophet(yearly_seasonality=True, weekly_seasonality=True).fit(df)
future = m.make_future_dataframe(periods=30)       # extend 30 days
fc = m.predict(future)
print(fc[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail(3))  # point + interval

flowchart TD
  Q{What do you have?} --> S1[Few series,<br/>clear seasonality]
  Q --> S2[Many related series,<br/>lots of history]
  Q --> S3[Need calendars +<br/>interpretability, fast]
  S1 --> A1[SARIMA / Holt-Winters]
  S2 --> A2[DeepAR / TFT /<br/>LSTM / TCN]
  S3 --> A3[Prophet]

Warning

Deep models are data-hungry and easy to overfit on a single short series. On one series with a few hundred points, a tuned SARIMA or LightGBM usually beats an LSTM. Deep forecasting earns its keep when you have many series to learn shared structure from — that is its whole advantage.

22.7 — State-space models and the Kalman filter

Many of the models above — exponential smoothing, ARIMA, structural time series — are secretly the same object wearing different clothes: a state-space model. The intuition is a tracking radar. There is a hidden “true state” you cannot see directly (an aircraft’s real position and speed), it drifts forward on its own, and all you actually get are noisy blips. A state-space model is just two short sentences about that picture: one says how the hidden thing moves on its own, the other says how a reading relates to it. In symbols:

\[\underbrace{\mathbf{x}_t = F\,\mathbf{x}_{t-1} + \mathbf{w}_t}_{\text{state transition}}, \qquad \underbrace{y_t = H\,\mathbf{x}_t + v_t}_{\text{observation}}\]

In words: the unseen state drifts forward by \(F\) with a little process noise \(\mathbf{w}\), and each reading \(y\) is the state seen through \(H\) plus measurement noise \(v\). Also written: with explicit noise covariances, \(\mathbf{w}_t \sim \mathcal{N}(0, Q)\) and \(v_t \sim \mathcal{N}(0, R)\), where \(Q\) is how unpredictably the state moves and \(R\) is how noisy the sensor is.

The Kalman filter is the optimal recursive recipe for estimating that hidden state as each new observation arrives. It alternates two moves: predict (push the state estimate forward with \(F\), growing the uncertainty) and update (pull the prediction toward the new measurement, shrinking the uncertainty). The pull strength is the Kalman gain \(K_t\) — high when the sensor is trustworthy (\(R\) small), low when it is noisy. It is exactly the error-correction logic of exponential smoothing (Section 22.4), generalized to vectors and made statistically optimal.

flowchart LR
  A["State estimate x̂₍t-1₎"] --> B["PREDICT<br/>x̂ = F·x̂, grow uncertainty"]
  B --> C["UPDATE<br/>blend with yₜ via gain Kₜ"]
  C --> D["Refined x̂ₜ<br/>(smaller uncertainty)"]
  D --> A

The predict→update breathing, animated — the cloud of uncertainty puffs out on predict, then snaps tight on update when a measurement arrives:

Kalman rhythm: predict puffs uncertainty out, update snaps it back true level new reading yₜ estimate x̂

Why care, when statsmodels hides all this? Three payoffs the framing buys you for free: it handles missing observations gracefully (just skip the update step and keep predicting), it naturally produces uncertainty at every step, and it lets you build custom structural models — declare a local-level + local-trend + seasonal state and let the filter fit it. A worked one-dimensional Kalman filter makes the predict/update rhythm concrete:

import numpy as np
# 1-D constant-level tracking: hidden state = true level, noisy readings y
rng = np.random.default_rng(0)
true_level = 10.0
y = true_level + rng.normal(0, 2.0, 20)        # noisy measurements

x, P = 0.0, 1.0          # state estimate and its variance
Q, R = 0.01, 4.0         # process noise (level barely moves), measurement noise
for obs in y:
    # predict
    P = P + Q
    # update
    K = P / (P + R)                  # Kalman gain: trust in the new reading
    x = x + K * (obs - x)            # error-correction step (cf. exponential smoothing)
    P = (1 - K) * P
print("filtered level estimate:", round(x, 2))   # converges near 10 despite noisy y

The same idea, declared structurally in statsmodels, fits a local-level-plus-seasonal model without hand-coding the recursion:

import numpy as np, pandas as pd
from statsmodels.tsa.statespace.structural import UnobservedComponents

idx = pd.date_range("2021-01-01", periods=48, freq="MS")
y = pd.Series(np.arange(48)*0.3 + 4*np.sin(2*np.pi*idx.month/12) + 20, index=idx)
# 'local linear trend' + seasonal(12) — a structural state-space model, fit by Kalman filter
res = UnobservedComponents(y, level="local linear trend", seasonal=12).fit(disp=False)
print(res.forecast(6).round(2))    # forecasts with intervals, missing-data tolerant
Tip

Reach for state-space / Kalman when you have missing or irregularly-sampled data, need online updating as observations stream in, or want to fuse multiple noisy sensors into one estimate (navigation, finance, IoT). It is the principled engine under ETS, ARIMA, and Prophet alike.

22.8 — Multivariate and probabilistic forecasting

So far each forecast was a single number for a single series. Two extensions matter in practice.

Multivariate forecasting uses more than one time-varying input. Distinguish two cases. With exogenous regressors (\(X\) known into the future — price, weather forecast, promotion calendar), you predict one target aided by side inputs: this is the “X” in SARIMAX or extra feature columns for a boosted model. With multiple targets that influence each other (a vector \(\mathbf{y}_t\) of related series), a Vector Autoregression (VAR) regresses the whole vector on its own lagged vectors, \(\mathbf{y}_t = \mathbf{c} + A_1\mathbf{y}_{t-1} + \dots + A_p\mathbf{y}_{t-p} + \boldsymbol{\varepsilon}_t\), capturing cross-series dependencies (electricity demand and temperature feeding each other).

In words: a VAR is “every series predicted from every series’ recent past at once,” so a bump in one can ripple into the forecasts of the others. Also written: stacking the lags into one big regressor, \(\mathbf{y}_t = \mathbf{c} + \sum_{i=1}^{p} A_i\,\mathbf{y}_{t-i} + \boldsymbol{\varepsilon}_t\), where each \(A_i\) is a \(k\times k\) matrix of cross-series coefficients.

A subtle but important warning when relating two series: spurious regression. Two unrelated series that both merely trend upward (ice-cream sales and drowning deaths) will look strongly correlated, because both ride a common rising tide — not because either drives the other. The cure is to difference (or test for cointegration, a genuine long-run equilibrium) before reading any cross-series relationship.

Probabilistic forecasting outputs a distribution — or at least an interval — instead of a point. This is what decisions actually need: a warehouse stocks to the 95th percentile of demand, not the mean. Three common routes:

  • Parametric — predict the parameters of a distribution (DeepAR predicts a Gaussian’s \(\mu_t, \sigma_t\) per step).
  • Quantile regression — directly predict chosen quantiles (10th, 50th, 90th) by minimizing the pinball loss, which penalizes over- and under-prediction asymmetrically per quantile.
  • Conformal / sampling — wrap any model with calibrated intervals, or Monte-Carlo sample many future paths.

The pinball (quantile) loss for a target quantile \(\tau\) is

\[L_\tau(y, \hat q) = \begin{cases} \tau\,(y - \hat q) & \text{if } y \ge \hat q \\ (1-\tau)\,(\hat q - y) & \text{if } y < \hat q \end{cases}\]

In words: charge \(\tau\) per unit when you fell short, and \((1-\tau)\) per unit when you overshot — for a high quantile like \(\tau=0.9\), being too low hurts nine times as much as being too high. Also written: as a single branch-free expression, \(L_\tau(y,\hat q) = \max\big(\tau(y-\hat q),\ (\tau-1)(y-\hat q)\big)\).

A worked pinball-loss feel: suppose you predict the 90th-percentile demand as \(\hat q = 100\) and the actual turns out to be \(y=130\). The pinball loss for quantile \(\tau=0.9\) charges \(\tau\,(y-\hat q) = 0.9\times 30 = 27\) for under-prediction. Had you instead over-predicted with \(\hat q = 160\), the penalty would be \((1-\tau)(\hat q - y) = 0.1\times 30 = 3\). The asymmetry (\(27\) versus \(3\)) is deliberate: for a high quantile, missing below the truth is punished far harder, which is exactly what pushes the predicted \(q\) up to where only ~10% of outcomes exceed it.

Gradient boosting can emit quantiles directly, giving cheap intervals from the same feature table as Section 22.5:

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor

rng = np.random.default_rng(0)
X = rng.normal(size=(400, 3))
y = X[:, 0] * 2 + rng.normal(0, 1, 400)

# one model per quantile; together they form a predictive interval
models = {q: GradientBoostingRegressor(loss="quantile", alpha=q).fit(X, y)
          for q in (0.1, 0.5, 0.9)}
lo, mid, hi = (models[q].predict(X[:1]) for q in (0.1, 0.5, 0.9))
print(f"80% interval: [{lo[0]:.2f}, {hi[0]:.2f}]  median {mid[0]:.2f}")

A point forecast with an interval, drawn out:

Probabilistic forecast: median + 80% interval (fan widens with horizon) observed past forecast → 80% interval

A forecast interval that is honest should contain the truth about as often as it claims — an 80% interval should cover ~80% of realized values. That calibration check (coverage) is the right way to score uncertainty, alongside proper scoring rules like the CRPS (Continuous Ranked Probability Score).

Tip

Even when you only report a point forecast, fit a probabilistic model and report the median — the interval width is a free, invaluable signal of when the model is unsure (long horizons, regime shifts) and when downstream decisions should hedge.

22.9 — Proper time-series cross-validation (no shuffling)

Here is the cardinal rule, and the one most often broken: you may never shuffle a time series for validation. Picture studying for tomorrow’s exam by peeking at tomorrow’s answer key — that is exactly what random shuffling does when it lets a fold train on data from after the test point. Standard k-fold cross-validation randomly assigns rows to folds, so the model peeks at the future. The resulting score is fantastically optimistic and collapses in production. In forecasting, validation must always train on the past and test on the future.

The correct scheme is rolling-origin (also called walk-forward or time-series split) cross-validation: pick an origin, train on everything up to it, test on the next block, then roll the origin forward and repeat. Two flavors: expanding window (training set grows each fold, keeping all history) and sliding window (training set is a fixed-width window that moves, useful when old data goes stale after a regime change).

Expanding-window walk-forward CV — train past, test future, never shuffle Fold 1 Fold 2 Fold 3 Fold 4 train (past) test (future) time →
import numpy as np
# manual expanding-window splits — note test always sits AFTER train
y = np.arange(12)
n_splits, test_size = 4, 2
for i in range(n_splits):
    train_end = 4 + i * test_size
    train_idx = np.arange(0, train_end)
    test_idx  = np.arange(train_end, train_end + test_size)
    print("train", train_idx, "test", test_idx)
# train [0 1 2 3]       test [4 5]
# train [0 1 2 3 4 5]   test [6 7]   <- origin rolls forward, no overlap into past

scikit-learn ships this exact scheme as TimeSeriesSplit, with an optional gap to purge leaky window features:

import numpy as np
from sklearn.model_selection import TimeSeriesSplit

y = np.arange(12)
tscv = TimeSeriesSplit(n_splits=4, test_size=2, gap=1)   # gap=1 purges the boundary
for tr, te in tscv.split(y):
    print("train", y[tr], "test", y[te])
# train always precedes test; the gap drops the point straddling the split

Two extra precautions: insert a gap (purge) between train and test if features use long windows, so a rolling feature can’t straddle the boundary; and when tuning hyperparameters, keep a final hold-out at the very end that you touch only once.

Warning

sklearn.model_selection.cross_val_score(..., shuffle=True) on time series is the silent killer of forecasting projects: it reports a great CV score that has no relationship to real future performance. Use TimeSeriesSplit and order your data by time first.

22.10 — Evaluation (MAE, MAPE, RMSE)

A forecast is only as trustworthy as the metric you grade it with. All compare predicted \(\hat y_t\) to actual \(y_t\) over a test horizon, but they reward different behaviors.

\[\text{MAE} = \frac{1}{n}\sum_t |y_t - \hat y_t|, \qquad \text{RMSE} = \sqrt{\frac{1}{n}\sum_t (y_t - \hat y_t)^2}, \qquad \text{MAPE} = \frac{100\%}{n}\sum_t \left|\frac{y_t - \hat y_t}{y_t}\right|.\]

In words: MAE is the typical miss in raw units; RMSE is a miss-size that counts big errors extra; MAPE is the typical miss as a percent of the truth. Also written: with the error \(e_t = y_t - \hat y_t\), these are \(\text{MAE}=\operatorname{mean}|e_t|\), \(\text{RMSE}=\sqrt{\operatorname{mean}(e_t^2)}\) (the \(\ell_2\) norm of \(e\) over \(\sqrt n\)), and \(\text{MAPE}=100\%\cdot\operatorname{mean}|e_t/y_t|\).

  • MAE (Mean Absolute Error) — average size of the error, in the data’s own units. Robust to outliers; easy to explain (“off by 12 units on average”).
  • RMSE (Root Mean Squared Error) — squares errors before averaging, so it punishes large misses disproportionately. Same units as the data. Use it when one big error is much worse than several small ones.
  • MAPE (Mean Absolute Percentage Error) — error as a percentage of the actual, so it is scale-free and comparable across series. But it explodes when \(y_t \approx 0\) and is asymmetric (it penalizes over-forecasting more than under). For series with zeros, prefer sMAPE or MASE (error scaled by a naive baseline’s error — MASE < 1 means you beat naive).

A subtle but consequential fact: which metric you optimize determines what point of the distribution you forecast. Minimizing squared error (RMSE) pulls predictions toward the conditional mean; minimizing absolute error (MAE) pulls them toward the median. For a right-skewed demand series these differ, so a model “tuned for MAE” will systematically sit lower than one “tuned for RMSE” — pick the metric that matches the decision, not by habit.

A worked comparison on five points:

import numpy as np
y    = np.array([100, 50, 200, 10, 150], dtype=float)
yhat = np.array([110, 45, 170, 14, 150], dtype=float)
err  = y - yhat
mae  = np.mean(np.abs(err))
rmse = np.sqrt(np.mean(err**2))
mape = np.mean(np.abs(err / y)) * 100
print(f"MAE={mae:.2f}  RMSE={rmse:.2f}  MAPE={mape:.1f}%")
# MAE=11.80  RMSE=14.97  MAPE=14.5%
# RMSE > MAE: the single 30-unit miss on the 200 point is amplified by squaring
# the 4-unit miss on y=10 contributes 40% MAPE alone -> small denominators dominate MAPE

The same metrics come straight from scikit-learn, plus MASE — the one that bakes in the naive-baseline comparison:

import numpy as np
from sklearn.metrics import mean_absolute_error, root_mean_squared_error

y    = np.array([100, 50, 200, 10, 150], dtype=float)
yhat = np.array([110, 45, 170, 14, 150], dtype=float)
print("MAE :", round(mean_absolute_error(y, yhat), 2))
print("RMSE:", round(root_mean_squared_error(y, yhat), 2))

# MASE = your MAE / the MAE of a one-step naive forecast (ŷ_t = y_{t-1})
naive_mae = np.mean(np.abs(np.diff(y)))           # error of "tomorrow = today"
mase = mean_absolute_error(y, yhat) / naive_mae
print("MASE:", round(mase, 2), "(<1 beats naive)")
Metric Units Outlier sensitivity Scale-free? Watch out for
MAE data units low no —
RMSE data units high (squares) no one outlier can dominate
MAPE percent medium yes blows up near \(y=0\); asymmetric
MASE ratio low yes needs a seasonal naive baseline
Tip

Always report a naive baseline alongside your metric: last value (\(\hat y_{t+1}=y_t\)) or seasonal naive (\(\hat y_{t+1}=y_{t+1-m}\)). If your fancy model can’t beat “tomorrow equals today,” it is not adding value — and MASE bakes exactly this comparison into a single number.

22.11 — Hierarchical forecasting and reconciliation

Real forecasting problems are rarely one lonely series — they come in hierarchies. Total company sales = sum of regions = sum of stores = sum of products. The catch: if you forecast each level independently, the numbers won’t add up. The store forecasts might sum to something different from the region forecast, which embarrasses everyone in the planning meeting. Coherence — children summing to their parent — is the property you want, and reconciliation is how you enforce it.

The intuition is a courtroom with conflicting witnesses: the top-level forecast and the bottom-level forecasts each have a story, and reconciliation finds the single set of numbers that respects the sum constraints while staying as close as possible to everyone’s testimony. The classic approaches:

  • Bottom-up — forecast the leaves, sum upward. Simple and coherent, but noisy leaves make a noisy total.
  • Top-down — forecast the total, split down by historical proportions. Stable at the top, but loses leaf-level signal.
  • Optimal (MinT) reconciliation — forecast every level independently, then adjust all of them jointly to be coherent while minimizing the variance of the reconciliation error. This is the modern default because it borrows strength across the whole tree.

flowchart TD
  T["Total"] --> R1["Region A"]
  T --> R2["Region B"]
  R1 --> S1["Store 1"]
  R1 --> S2["Store 2"]
  R2 --> S3["Store 3"]
  R2 --> S4["Store 4"]

Formally, let \(\hat{\mathbf{y}}\) stack the base forecasts for all nodes and let the summing matrix \(S\) encode “which leaves add into which node.” A reconciled forecast is \(\tilde{\mathbf{y}} = S G \hat{\mathbf{y}}\), where \(G\) maps base forecasts to a coherent set.

In words: take everybody’s independent forecast, run it through a correction \(G\), then re-sum with \(S\) so every parent exactly equals its children. Also written: bottom-up is the special case \(G = [\,0 \mid I\,]\) (keep only the leaf forecasts, discard the rest), so \(\tilde{\mathbf{y}} = S\,\hat{\mathbf{y}}_{\text{leaves}}\).

A tiny coherence check makes the point. Suppose two stores forecast \(\hat y_1 = 30\) and \(\hat y_2 = 50\), while the region (forecast independently) says \(\hat y_{\text{region}} = 70\). They disagree — leaves sum to \(80\), parent says \(70\). Bottom-up declares the region to be \(80\); top-down keeps \(70\) and rescales the stores to \(26.25\) and \(43.75\); MinT splits the difference using each forecast’s reliability. All three return a coherent set; they differ in whom they trust.

import numpy as np
# bottom-up reconciliation: 2 stores -> region, by an explicit summing matrix S
S = np.array([[1, 1],    # region = store1 + store2
              [1, 0],    # store1
              [0, 1]])   # store2
base_leaves = np.array([30.0, 50.0])     # the two leaf forecasts
coherent = S @ base_leaves               # [region, store1, store2]
print(coherent)                          # [80. 30. 50.] -> parent now equals its children
assert coherent[0] == coherent[1] + coherent[2]
Tip

Libraries like hierarchicalforecast (Nixtla) and R’s fable implement bottom-up, top-down, and MinT out of the box. Reach for reconciliation whenever forecasts at different aggregation levels are reported together and must agree — retail demand by store/region/national, energy load by feeder/substation/grid.

22.12 — Anomalies, missing data, and intermittent demand

Textbook series are clean and evenly spaced; real ones have gaps, spikes, and stretches of pure zero. A model is only as good as the preprocessing in front of it, so three messy realities deserve their own playbook.

Missing values and irregular spacing. A skipped sensor reading or a holiday closure leaves a hole. Naively forward-filling can fabricate trends; the safer defaults are time-aware interpolation (respecting the actual timestamps, not row positions) for short gaps, and an explicit missing indicator column for models that can use it. For genuinely irregular timestamps, resample to a fixed grid first.

import pandas as pd, numpy as np
idx = pd.date_range("2024-01-01", periods=8, freq="D")
y = pd.Series([10, np.nan, np.nan, 16, 18, np.nan, 22, 25], index=idx)
filled = y.interpolate(method="time")        # respects timestamp spacing, not row index
print(filled.round(1).tolist())              # [10.0, 12.0, 14.0, 16.0, 18.0, 20.0, 22.0, 25.0]

Anomalies and outliers. A one-off spike (a data-entry error, a flash sale) can wreck a fitted trend or seasonal estimate. The intuition: fit a quick model, look at the residuals, and flag points whose residual is implausibly large (say beyond 3 robust standard deviations). Decide deliberately whether each is signal to keep (a real promotion you’ll see again) or noise to clip before fitting. Anomaly detection in temporal data gets its own treatment in Anomaly & Fraud Detection.

import numpy as np
y = np.array([10, 11, 12, 95, 13, 14, 12, 11], dtype=float)   # 95 is the spike
med = np.median(y)
mad = np.median(np.abs(y - med)) + 1e-9       # robust spread, ignores the outlier itself
z = 0.6745 * (y - med) / mad                  # modified z-score
print("flagged indices:", np.where(np.abs(z) > 3.5)[0])   # [3] -> the 95

Intermittent / count demand. Many real series — spare parts, rare SKUs — are mostly zeros with occasional small counts. Squared-error models and percentage metrics both break here (MAPE divides by zero; a smoother predicts a meaningless fractional demand). The classic specialist is Croston’s method, which forecasts two things separately — the size of nonzero demands and the gap between them — then combines them.

Intermittent demand — mostly zero, sparse spikes (Croston territory) mostly zeros between sparse, bursty demands → forecast size & interval separately
Warning

For intermittent series, never grade with MAPE (division by zero) and be wary of MAE rewarding the all-zero forecast. Use MASE against a naive baseline, and report demand-over-a-lead-time rather than per-period point forecasts.

22.13 — Foundation models for time series

The newest shift mirrors what happened in language: instead of fitting a fresh model on your series, you reach for a pretrained foundation model that has already seen billions of timestamps across electricity, traffic, finance, weather, and retail, and ask it to forecast yours — often zero-shot, with no training on your data at all. The intuition is transfer learning: just as a language model that read the whole web can answer a question about a topic it never saw labeled, a time-series model that absorbed enough series learns the universal grammar of trends, seasonalities, and spikes, then applies it to a series it has never met.

A few notable families (as of the mid-2020s): TimesFM (Google), Chronos (Amazon, which tokenizes numeric values and reuses a language-model backbone), Moirai (Salesforce), and Lag-Llama. They differ in architecture but share the recipe: a large Transformer pretrained on a huge, diverse corpus of series, then served for forecasting via a context window of your recent history.

flowchart LR
  A["Huge diverse corpus<br/>(many domains)"] --> B["Pretrain large<br/>Transformer once"]
  B --> C["Your series'<br/>recent history"]
  C --> D["Zero-shot forecast<br/>(no fitting)"]
  D --> E["Optional fine-tune<br/>if you have data"]

A zero-shot forecast in a handful of lines (Chronos via Hugging Face):

import torch, numpy as np
from chronos import ChronosPipeline    # pip install chronos-forecasting

pipe = ChronosPipeline.from_pretrained("amazon/chronos-t5-small",
                                       device_map="cpu", torch_dtype=torch.float32)
context = torch.tensor(np.sin(np.arange(120) * 2*np.pi/12), dtype=torch.float32)
# predict 12 steps, drawing sample paths -> a full predictive distribution
forecast = pipe.predict(context, prediction_length=12, num_samples=20)  # (1, 20, 12)
median = np.median(forecast[0].numpy(), axis=0)
print("zero-shot median forecast:", np.round(median, 2))

When do these win? They shine for cold-start series with little history, for rapid prototyping (“a decent forecast in one line”), and as a strong zero-effort baseline. When do classical methods still win? On a single well-behaved series with clear seasonality and ample history, a tuned SARIMA or LightGBM is often more accurate, far cheaper to run, and fully interpretable. The honest summary: foundation models lower the floor (a reasonable forecast with no expertise) more than they raise the ceiling.

Warning

Foundation forecasters are heavy and can hallucinate spurious seasonality on series unlike anything in their training mix. Always benchmark them against the naive baseline and a classical model on your held-out tail before trusting them — the golden rule of Section 22.1 does not bend for big models.

22.14 — Quick reference

Term / formula Meaning in one line When / why it matters
Golden rule No info from time \(> t\) in a forecast made at \(t\) Every leakage bug and too-good CV score traces here
Naive / seasonal naive \(\hat y_{t+1}=y_t\) / \(\hat y_{t+h}=y_{t+h-m}\) The bar every model must clear; denominator of MASE
Recursive vs. direct Feed forecasts back vs. one model per horizon Recursive is simple but compounds error; direct avoids it
Decomposition \(y_t = T_t + S_t + \varepsilon_t\) (or \(\times\)) Split trend/season/noise (STL) before modeling
Stationarity Mean, variance, autocorrelation constant over time Classical models assume it; transform until it holds
Differencing \(\nabla y_t = y_t - y_{t-1}\); seasonal \(\nabla_m = y_t - y_{t-m}\) Removes trend / season; difference once, re-check
ADF / KPSS Tests with opposite nulls for a unit root Both flag non-stationary → you must difference
ACF / PACF Self-correlation; direct self-correlation Fingerprint AR\((p)\)/MA\((q)\) order; need stationarity first
SES \(\ell_t = \alpha y_t + (1-\alpha)\ell_{t-1}\) Flat, noisy series; weights decay exponentially
Holt-Winters / ETS Level + trend + season smoothing Clear seasonal pattern, fast few-parameter fit
ARIMA\((p,d,q)\) / SARIMA AR + differencing + MA (+ seasonal copy) Stationary-after-differencing series; classic baseline
Lag / window / Fourier features Past values, rolling stats, \((\sin,\cos)\) cycles Reshape forecasting into tabular ML; .shift() to avoid leak
State-space + Kalman Hidden state + noisy reading; predict→update Missing / streaming / multi-sensor data, free uncertainty
Pinball loss \(L_\tau = \max(\tau e,(\tau-1)e)\) Train quantiles for probabilistic intervals
Walk-forward CV Train past, test future, roll origin The only valid time-series CV; never shuffle
MAE / RMSE / MAPE / MASE Mean abs / RMS / % / naive-scaled error RMSE→mean, MAE→median; MASE \(<1\) beats naive
Reconciliation (MinT) Adjust all levels to be coherent Hierarchical forecasts must sum up across levels
Croston’s method Forecast demand size and gap separately Intermittent / mostly-zero count series
Foundation model Pretrained Transformer, zero-shot forecast Cold-start series, fast prototyping baseline

22.15 — Key takeaways

  • Forecasting lives by one golden rule: never let information from after time \(t\) touch a forecast made at \(t\). Fix the horizon, expect error to grow with it, and always start from a naive baseline.
  • Decompose every series into trend, seasonality, cyclic, noise (STL in practice); most classical models need stationarity, achieved by differencing (and logs / Box-Cox for growing variance), checked with ADF/KPSS.
  • ACF and PACF are your model-order fingerprint: ACF cuts off → MA order \(q\); PACF cuts off → AR order \(p\). Check residuals are white noise (Ljung-Box).
  • Classical baselines climb from moving average → exponential smoothing (ETS) → Holt-Winters → ARIMA → SARIMA; each adds a component (level, trend, season, autocorrelation).
  • Feature-based ML reshapes forecasting into tabular regression with lag/window/calendar/Fourier features and gradient boosting — pragmatic and strong, but trees cannot extrapolate and leak easily.
  • Deep forecasting (LSTM, TCN, Transformers, Prophet, DeepAR) shines with many related series; on one short series, simpler usually wins.
  • State-space models + Kalman filter unify ETS/ARIMA/structural models and handle missing, streaming, and multi-sensor data with built-in uncertainty.
  • Multivariate adds exogenous regressors (SARIMAX) or cross-series VAR (mind spurious regression); probabilistic forecasting outputs intervals/distributions (pinball loss, quantiles), judged by calibration and CRPS.
  • Hierarchical forecasts must be reconciled (bottom-up, top-down, MinT) so levels add up coherently.
  • Handle real-world mess: interpolate gaps time-aware, flag anomalies by robust residuals, use Croston/MASE for intermittent demand.
  • Never shuffle: validate with walk-forward (rolling-origin) CV (TimeSeriesSplit), train-past / test-future, with a gap to prevent leakage.
  • Grade with MAE / RMSE / MAPE chosen for the use case (RMSE→mean, MAE→median), always against a naive baseline (MASE).
  • Foundation models (Chronos, TimesFM, Moirai) give strong zero-shot forecasts for cold-start series — benchmark them against naive and classical before trusting.

22.16 — See also

  • [Probability & Statistics] — distributions, hypothesis tests (ADF), and the foundations of stationarity and uncertainty.
  • [Recurrent & Sequence Models] — RNN/LSTM/GRU internals that power deep forecasters.
  • [Attention & Transformers] — self-attention behind Informer, Autoformer, TFT, and time-series foundation models.
  • [Ensemble Methods] — gradient boosting (XGBoost/LightGBM) used for feature-based forecasting.
  • [Anomaly & Fraud Detection] — detecting outliers and regime changes in temporal data.
  • [Model Evaluation & Tuning] — cross-validation principles adapted here for time order.
  • [Regression] — the linear-model backbone underneath AR, VAR, and quantile forecasting.

↪ The thread continues → Chapter 23 · 📚 Large Language Models

Predicting the next value in a series and predicting the next word are the same game at scale. Push it to web-scale text and you get the large language model — the defining technology of the era.


📖 All chapters  |  ← 21 · 🔊 Speech & Audio Processing  |  23 · 📚 Large Language Models →

 

© Kader Mohideen