Chapter 35 — 🧬 Evolutionary Computation & Metaheuristics

📖 All chapters | ← 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing | 36 · 🔍 Explainable AI & Interpretability →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Some optimization problems hand you no gradient: the search space is a tangle of discrete choices, the objective is a black-box simulation, or the landscape is riddled with local optima that would trap any hill-climber. Metaheuristics are general-purpose, gradient-free search strategies — many inspired by nature (evolution, swarms, annealing metal) — that explore such landscapes by sampling, recombining, and perturbing candidate solutions. They rarely promise the global optimum, but they routinely find excellent solutions where calculus cannot even start. This chapter sits in Classical & Symbolic AI: it is the toolbox you reach for when the smooth machinery of Optimization does not apply.

🧭 In context: Classical & Symbolic AI · used for gradient-free, black-box, and combinatorial optimization · the one key idea is explore a population of candidate solutions and let selection + variation steer them toward better regions, no derivatives required.

💡 Remember this: when there’s no usable gradient, keep a set of candidate solutions and let selection plus random variation push them toward better regions — every method here is just a different way to balance exploring widely against refining the best find so far.

The two-knob picture. Before any specific algorithm, hold one image in your head. Every metaheuristic is balancing two opposing urges: exploration (wander widely, sample new regions, don’t commit) and exploitation (zoom in on the best thing found so far, refine it). Too much exploration and you never settle; too much exploitation and you get stuck in the first decent valley you find. Each method in this chapter is, at heart, a different gadget for tuning that one dial over time.

35.1 — Genetic Algorithms

A genetic algorithm (GA) treats optimization like breeding. You keep a population of candidate solutions, each encoded as a chromosome (often a bit-string or vector of numbers called the genotype). A fitness function scores each one. The better-scoring individuals are more likely to be selected as parents; parents crossover to mix their genes into children; children mutate with small random tweaks; and the best individuals are sometimes carried over untouched (elitism). Repeat for many generations and the population drifts toward high-fitness regions — Darwin as an optimizer.

The intuition: crossover exploits good building blocks already discovered, while mutation explores new variations so the search does not stagnate. The balance between these two is the whole game.

Let’s define each operator on a tiny problem: maximize the number of 1-bits in an 8-bit string (the “OneMax” toy problem, optimum = 8).

Selection picks parents in proportion to fitness. Tournament selection (pick \(k\) at random, keep the best) is the common, robust choice.
Crossover (recombination): with single-point crossover, pick a cut point and swap tails.

Parent A: 1 1 0 1 | 0 0 1 0
Parent B: 0 0 1 0 | 1 1 1 1
             cut ↑
Child 1 : 1 1 0 1 | 1 1 1 1   ← A-head + B-tail
Child 2 : 0 0 1 0 | 0 0 1 0   ← B-head + A-tail

Mutation flips each bit with small probability \(p_m\) (e.g. \(1/L\) where \(L\) is the string length), injecting fresh genetic material.
Elitism copies the top few individuals into the next generation verbatim, guaranteeing the best fitness never decreases.

flowchart LR
    A[Initialize<br/>random population] --> B[Evaluate fitness]
    B --> C{Stop?<br/>converged / budget}
    C -->|yes| Z[Return best]
    C -->|no| D[Selection<br/>tournament]
    D --> E[Crossover<br/>recombine parents]
    E --> F[Mutation<br/>random tweaks]
    F --> G[Elitism<br/>carry best over]
    G --> B

Here is the full loop from scratch — short enough to read in one sitting.

import random

L = 8                                    # chromosome length
fitness = lambda c: sum(c)               # OneMax: count the 1s

def random_indiv():  return [random.randint(0, 1) for _ in range(L)]
def tournament(pop, k=3):                # pick k, return the fittest
    return max(random.sample(pop, k), key=fitness)
def crossover(a, b):                      # single-point
    p = random.randint(1, L - 1)
    return a[:p] + b[p:], b[:p] + a[p:]
def mutate(c, pm=1/L):                    # flip each bit w.p. pm
    return [1 - g if random.random() < pm else g for g in c]

pop = [random_indiv() for _ in range(20)]
for gen in range(50):
    pop.sort(key=fitness, reverse=True)
    elite = pop[:2]                      # elitism: keep top 2
    children = []
    while len(children) < len(pop) - 2:
        c1, c2 = crossover(tournament(pop), tournament(pop))
        children += [mutate(c1), mutate(c2)]
    pop = elite + children[:len(pop) - 2]
    if fitness(pop[0]) == L: break       # found the optimum (all 1s)

assert fitness(max(pop, key=fitness)) == L   # OneMax is solvable, must converge
print(gen, max(pop, key=fitness))

Run it and the best fitness climbs from roughly 4 (random 8-bit strings average four 1s) to 8 within a dozen generations. The population doesn’t compute the answer; it evolves toward it.

Tip

Rule of thumb for GA hyperparameters: population size 20–200, mutation rate near \(1/L\), crossover probability 0.6–0.9, and always use elitism — it is one line and it stops the best solution from being lost to unlucky variation.

The fitness-proportional selection formula

Tournament selection is the robust default, but the original GA used fitness-proportional (or “roulette wheel”) selection, which is worth seeing because it makes the breeding metaphor exact: an individual’s chance of being a parent is its slice of the total fitness pie.

\[P(\text{pick } i) = \frac{f_i}{\sum_{j=1}^{N} f_j}\]

In words: the probability of choosing individual \(i\) as a parent equals its fitness divided by the summed fitness of the whole population — fitter individuals get a proportionally bigger slice of the roulette wheel.

Also written: \(P(\text{pick } i) = f_i / (f_1 + f_2 + \dots + f_N)\), i.e. normalize the fitness vector \(\mathbf{f}\) to sum to 1, then sample from that categorical distribution.

A worked slice: four individuals with fitnesses \([6, 3, 1, 0]\) give a total of \(10\), so their selection probabilities are \([0.6, 0.3, 0.1, 0.0]\) — the best individual is picked six times as often as the third, and the dead-last one never. The danger this exposes is exactly why tournament selection won out: if one early individual scores \(100\) while the rest score \(1\), it grabs nearly the entire wheel and diversity collapses on the spot.

The roulette wheel below makes those four slices visible — fitness \(6\) owns more than half the wheel, fitness \(0\) owns nothing.

Warning

Premature convergence is the classic GA failure: one decent individual’s genes flood the population, diversity collapses, and the search freezes on a local optimum. Symptoms: every chromosome looks alike by generation 10. Fixes: larger population, higher mutation, tournament (not fitness-proportional) selection, or explicit diversity preservation (niching/crowding).

35.2 — Genetic Programming

Genetic programming (GP) is a GA whose individuals are programs rather than fixed-length vectors. Instead of evolving a string of bits, you evolve a syntax tree — internal nodes are functions (\(+\), \(\times\), if, sin), leaves are inputs and constants. Fitness measures how well the program’s output matches a target (e.g. how closely an evolved formula fits a dataset). The astonishing part: GP can invent the form of a solution, not just tune its parameters, which makes it a tool for symbolic regression — discovering an equation that explains data (an evolutionary cousin of the fitted models in Regression).

Consider evolving a formula for the points \((1,2),(2,6),(3,12),(4,20)\) — the true law is \(x^2 + x\). A candidate program is the tree below, which encodes \((x \times x) + x\):

To score this candidate, you run the tree on each data point and compare to the target. Here is the fitness computation from scratch — the tree as a nested tuple, a tiny recursive evaluator, and the sum-of-squared-errors (lower is better):

# tree: ('+', ('*', 'x', 'x'), 'x')   ==  x*x + x
def ev(node, x):
    if node == 'x': return x
    op, a, b = node
    a, b = ev(a, x), ev(b, x)
    return a + b if op == '+' else a * b      # only + and * here

tree = ('+', ('*', 'x', 'x'), 'x')
data = [(1, 2), (2, 6), (3, 12), (4, 20)]
preds = [ev(tree, x) for x, _ in data]        # -> [2, 6, 12, 20]
sse   = sum((ev(tree, x) - y)**2 for x, y in data)
print(preds, sse)                             # [2, 6, 12, 20]  0   perfect fit

bad = ('*', 'x', 'x')                          # candidate x*x
bad_sse = sum((ev(bad, x) - y)**2 for x, y in data)
assert sse == 0 and bad_sse == 30              # the right tree wins on fitness

The evolved tree predicts exactly \([2,6,12,20]\), matching the data for a fitness (SSE) of \(0\), while the simpler rival \(x^2\) scores \(30\) — so selection rightly prefers the correct formula. Crossover swaps subtrees between two parent programs (e.g. graft one parent’s \(x \times x\) branch onto another); mutation replaces a random subtree with a fresh one. Because trees vary in size, GP has a chronic disease called bloat — trees grow huge with code that does nothing (e.g. + 0, × 1), wasting evaluation time. Practitioners fight bloat with parsimony pressure: add a small penalty for tree size to the fitness so simpler programs win ties.

Subtree crossover is the engine that lets GP build new programs out of old parts. The figure shows the idea: snip a branch from each parent and swap them, and the child inherits whole working sub-expressions rather than single bits.

The parsimony-adjusted fitness is just the raw error plus a size tax:

\[\text{fitness}(\text{tree}) = \text{error}(\text{tree}) + c \cdot \text{size}(\text{tree})\]

In words: an individual’s score is how badly it fits the data plus a small penalty proportional to how many nodes the tree has — so among two equally accurate programs, the shorter one wins.

Also written (we minimize, so smaller is better): \(\arg\min_{\text{tree}} \big[\, \text{SSE}(\text{tree}) + c\,|\text{tree}|\, \big]\), where \(|\text{tree}|\) counts nodes and \(c\) is a small constant like \(0.01\).

Aspect	Genetic Algorithm	Genetic Programming
Individual	Fixed-length vector / bit-string	Variable-size program tree
Crossover	Swap segments	Swap subtrees
Output	A solution (parameters)	A program / formula
Signature pitfall	Premature convergence	Code bloat

Doing it with a real framework

You rarely hand-roll GP in practice. The gplearn library (scikit-learn-compatible) does symbolic regression out of the box — you hand it data, it hands you a readable formula:

from gplearn.genetic import SymbolicRegressor
import numpy as np

X = np.arange(1, 5).reshape(-1, 1)        # x = 1,2,3,4
y = X.ravel()**2 + X.ravel()              # target law x^2 + x  -> [2,6,12,20]

gp = SymbolicRegressor(population_size=500, generations=20,
                       function_set=('add', 'mul'),
                       parsimony_coefficient=0.01,  # bloat penalty (the c above)
                       random_state=0)
gp.fit(X, y)
print(gp._program)        # e.g.  add(mul(X0, X0), X0)   == x*x + x

Tip

GP shines when you want an interpretable model — a human-readable equation or rule set — rather than a black-box. That readability is its main edge over a neural net for the same symbolic-regression task.

35.3 — Evolution Strategies & Neuroevolution

Evolution strategies (ES) are an evolutionary family built for real-valued vectors, where mutation — not crossover — is the engine. The hallmark is self-adaptive Gaussian mutation: each individual carries its own step size \(\sigma\), and you perturb a parent by adding Gaussian noise \(\mathcal{N}(0, \sigma^2)\). The notation \((\mu, \lambda)\) means \(\mu\) parents generate \(\lambda\) children and the next generation is chosen only from the children; \((\mu + \lambda)\) keeps parents and children together (an elitist variant).

The 1/5 success rule is just an automatic step-size thermostat. Watch how often your random tweaks actually help. If they help often (more than 1 in 5), you’re in easy terrain — take bigger steps to move faster. If they help rarely (fewer than 1 in 5), you’re near the target overshooting — take smaller steps to fine-tune. That’s the whole rule: success is plentiful, grow the step; success is scarce, shrink it. It’s a gradient-free version of an adaptive learning rate. The code below measures that success rate over a sliding window of trials (matching the math, rather than the per-step shortcut):

import numpy as np
f = lambda x: -np.sum((x - 3)**2)        # max at x = [3,3], optimum 0
x, sigma = np.zeros(2), 1.0              # parent + step size
window = []                              # recent success flags (1=improved)
for t in range(200):
    child = x + sigma * np.random.randn(2)        # Gaussian mutation
    success = f(child) > f(x)
    if success: x = child                          # (1+1)-ES: keep child if better
    window.append(success)
    if len(window) == 10:                          # 1/5 rule over a 10-step window
        rate = sum(window) / 10.0
        sigma *= 1.22 if rate > 0.2 else 0.82      # >1/5 succeed -> grow; else shrink
        window = []
assert np.allclose(x, 3, atol=0.3)       # must converge near [3,3]
print(x, sigma)

A modern, scalable cousin is CMA-ES (Covariance Matrix Adaptation), which learns the full covariance of the mutation distribution — effectively discovering the local shape of the landscape and stretching its search ellipse along promising directions. It is one of the strongest off-the-shelf optimizers for expensive, low-dimensional continuous problems.

The natural-gradient ES estimator (intuition behind OpenAI-ES)

Here is the one idea, plainly: you can fake a gradient with random guesses. Stand in fog on a hill — you can’t feel the slope. So take a hundred tiny steps in random directions, note your height after each, and average the directions, weighting each by how much it raised you. Steps that went up vote strongly; steps that went down vote against. The weighted average points uphill. No slope ever measured — just rewards and the directions that earned them. That average is exactly what the formula below computes.

\[\nabla_\theta\, \mathbb{E}_{\epsilon \sim \mathcal{N}(0,I)}\big[f(\theta + \sigma\epsilon)\big] = \frac{1}{\sigma}\,\mathbb{E}_{\epsilon}\big[f(\theta + \sigma\epsilon)\,\epsilon\big]\]

In words: the gradient of the smoothed objective equals the average of each random perturbation weighted by the reward it earned — good directions vote with positive weight, bad ones with negative, and the weighted vote points uphill.

Also written (Monte-Carlo estimate over \(n\) samples): \(\hat g = \dfrac{1}{n\sigma}\sum_{k=1}^{n} f(\theta + \sigma\epsilon_k)\,\epsilon_k\), then update \(\theta \leftarrow \theta + \alpha\,\hat g\).

A two-sample taste, with \(\sigma = 1\): probe \(\epsilon_1 = +1\) earns reward \(f = 8\), probe \(\epsilon_2 = -1\) earns reward \(f = 2\). The estimate is \(\hat g = \tfrac{1}{2}\big(8\cdot(+1) + 2\cdot(-1)\big) = 3\) — positive, so the update nudges \(\theta\) in the \(+\) direction, toward the bigger reward. The high-reward probe out-voted the low-reward one, and we never touched a derivative.

This is the engine of OpenAI’s “Evolution Strategies as a Scalable Alternative to RL” (2017): no backprop through the environment, every worker just evaluates one perturbed policy and reports a scalar reward, and the workers need only share random seeds — embarrassingly parallel.

Neuroevolution applies these ideas to neural networks: the genome is the network. The simplest form evolves the weight vector of a fixed architecture with ES — surprisingly competitive with reinforcement learning on control tasks because it parallelizes trivially (every worker evaluates one perturbation, no backprop needed). The more ambitious form, NEAT (NeuroEvolution of Augmenting Topologies), evolves the structure too — starting minimal and adding nodes and connections, using historical “innovation numbers” to align genomes during crossover and speciation to protect young topologies from being out-competed before they mature.

flowchart TD
    A[Neuroevolution] --> B[Fixed topology<br/>evolve weights only · ES/GA]
    A --> C[Variable topology<br/>NEAT: grow nodes + links]
    C --> D[Innovation numbers<br/>align genomes]
    C --> E[Speciation<br/>protect new structures]

CMA-ES with one library call

The cma package gives you a research-grade CMA-ES optimizer in a few lines — useful whenever you have a noisy, non-differentiable, low-dimensional objective:

import cma
# minimize the 2-D sphere; optimum 0 at the origin
es = cma.CMAEvolutionStrategy([5, 5], 0.5)          # initial mean, initial sigma
es.optimize(lambda x: sum(xi**2 for xi in x))
print(es.result.xbest, es.result.fbest)             # near [0, 0], ~0

Warning

ES needs \(O(d)\)–\(O(d^2)\) samples to make progress in \(d\) dimensions, so weight-evolving a million-parameter network directly is hopeless. Neuroevolution wins on small networks, non-differentiable rewards, or when you want to evolve architecture — not as a replacement for backprop on large supervised nets.

35.4 — Swarm Intelligence: Particle Swarm & Ant Colony

Swarm intelligence drops the breeding metaphor for a social one: many simple agents, following local rules and sharing information, produce intelligent global behavior with no central controller. Two algorithms dominate.

Particle Swarm Optimization (PSO) scatters a swarm of particles through a continuous search space. Each particle has a position \(x_i\) and velocity \(v_i\), and remembers its own best spot ever found, \(p_i\) (personal best), while the swarm shares the best spot anyone has found, \(g\) (global best). Every step, each particle is pulled toward both:

\[v_i \leftarrow w\,v_i + c_1 r_1 (p_i - x_i) + c_2 r_2 (g - x_i), \qquad x_i \leftarrow x_i + v_i\]

In words: a particle’s new velocity keeps a fraction of its old momentum, plus a tug back toward the best spot it has personally seen, plus a tug toward the best spot the whole swarm has seen — then it moves by that velocity.

Also written (component roles named): \(v_i \leftarrow \underbrace{w v_i}_{\text{inertia}} + \underbrace{c_1 r_1 (p_i - x_i)}_{\text{cognitive (self)}} + \underbrace{c_2 r_2 (g - x_i)}_{\text{social (swarm)}}\).

The term \(w v_i\) is inertia (keep cruising), \(c_1\) pulls toward personal memory (cognitive), \(c_2\) toward the swarm’s discovery (social); \(r_1, r_2\) are random in \([0,1]\) to add stochasticity. Particles swirl around, balancing independent exploration against rallying to the best-known region.

The three tugs add up like a small vector sum. The figure decomposes one particle’s update: keep some of where you were heading, lean back toward your own best, and lean toward the swarm’s best — the dashed resultant is the new velocity.

import numpy as np
f = lambda x: np.sum(x**2)               # minimize, optimum 0 at origin
n, d = 20, 2
x = np.random.uniform(-5, 5, (n, d)); v = np.zeros((n, d))
p = x.copy(); g = p[np.argmin([f(xi) for xi in p])]   # personal & global best
for _ in range(60):
    r1, r2 = np.random.rand(n, d), np.random.rand(n, d)
    v = 0.7*v + 1.5*r1*(p - x) + 1.5*r2*(g - x)        # inertia + cognitive + social
    x += v
    better = np.array([f(xi) for xi in x]) < np.array([f(pi) for pi in p])
    p[better] = x[better]                              # update personal bests
    g = p[np.argmin([f(pi) for pi in p])]              # update global best
assert f(g) < 1e-2                        # swarm must reach near the origin
print(g, f(g))

Ant Colony Optimization (ACO) solves combinatorial problems — classically the Traveling Salesman Problem (the kind of constraint-laden combinatorial search covered in Planning, CSP & Game Playing) — by imitating how ants find short paths using pheromone. Each artificial ant builds a tour, choosing the next city probabilistically from a blend of the pheromone \(\tau\) on an edge (learned desirability) and a heuristic \(\eta = 1/\text{distance}\) (greedy nearness):

\[P(i \to j) \;=\; \frac{\tau_{ij}^{\alpha}\,\eta_{ij}^{\beta}}{\sum_{k \in \text{allowed}} \tau_{ik}^{\alpha}\,\eta_{ik}^{\beta}}\]

In words: the chance an ant at city \(i\) steps to city \(j\) is the “attractiveness” of edge \(i\!\to\!j\) (its pheromone raised to power \(\alpha\), times its nearness raised to power \(\beta\)) divided by the summed attractiveness of every city the ant could still go to — so it’s a softmax over edge desirability.

Also written (proportionality form): \(P(i \to j) \propto \tau_{ij}^{\alpha}\,\eta_{ij}^{\beta}\), normalized over the unvisited cities; \(\alpha\) weights experience (pheromone), \(\beta\) weights greed (nearness).

After all ants finish, pheromone evaporates (\(\tau \leftarrow (1-\rho)\tau\), forgetting stale trails) and is reinforced on the edges of the best tours (good routes smell stronger). Over many rounds, short edges accumulate pheromone and the colony converges on a near-optimal tour — a memory stored in the environment, not in any single ant. The diagram shows a 4-city graph after a few rounds: thicker, darker edges carry more pheromone, and the colony’s preferred tour stands out.

A tiny numeric trace makes the choice rule concrete. Suppose an ant sits at city A with pheromone and heuristic values on its three outbound edges, using \(\alpha = \beta = 1\):

# edges A->B, A->C, A->D : (pheromone tau, heuristic eta = 1/dist)
tau = [4.0, 1.0, 2.0]
eta = [0.5, 1.0, 0.25]                  # B is far, C is near, D is mid
w   = [t*e for t, e in zip(tau, eta)]    # tau^1 * eta^1  -> [2.0, 1.0, 0.5]
s   = sum(w)
P   = [round(wi/s, 3) for wi in w]       # normalize to probabilities
print(P)                                 # [0.571, 0.286, 0.143]
assert abs(sum(P) - 1) < 1e-9            # valid distribution
# evaporation then reinforcement of the chosen edge (say A->B got used)
rho = 0.1
tau = [(1-rho)*t for t in tau]           # all trails fade a bit
tau[0] += 1.0                            # winning edge gets fresh pheromone
print([round(t,2) for t in tau])         # [4.6, 0.9, 1.8]

Edge A→B wins most of the probability (0.571) because it has both strong pheromone and decent nearness; after the round, every trail evaporates slightly and the used edge is topped up — so good edges compound over time while unused ones fade.

	PSO	ACO
Problem type	Continuous	Combinatorial (graphs, routes)
Shared memory	Global best position \(g\)	Pheromone trails \(\tau\) on edges
Each agent	Particle with velocity	Ant building a path
Forgetting mechanism	Inertia decay	Pheromone evaporation

Tip

Reach for PSO when the space is continuous and you want something even simpler to implement than a GA; reach for ACO when the problem is a graph/routing problem where partial solutions are built step by step (TSP, vehicle routing, network design).

Where this actually ships

These are not just toys. ACO and PSO underpin real logistics: vehicle-routing software for parcel delivery (which truck visits which stops in what order) routinely uses ACO-style heuristics; PSO tunes antenna array geometries, PID controller gains, and power-system dispatch. The common thread: a black-box simulator scores a candidate configuration, and there is no gradient to follow — exactly the metaheuristic sweet spot.

35.5 — Simulated Annealing & Tabu Search

The previous methods kept populations. The last two trajectory-based metaheuristics track a single current solution and move smartly through its neighborhood — proof that you don’t need a crowd to escape local optima.

Simulated annealing (SA) borrows from metallurgy: heat metal so atoms jiggle freely, then cool it slowly so they settle into a low-energy crystal. SA treats the objective as energy and uses a temperature \(T\) that starts high and cools. At each step it proposes a random neighbor; if it’s better, accept it; if it’s worse, accept it anyway with probability \(\exp(-\Delta E / T)\). Early on (hot \(T\)) it freely accepts downhill moves and roams the landscape; as \(T\) cools, those uphill escapes become rare and it behaves like greedy hill-climbing. The willingness to temporarily get worse is exactly what lets it climb out of local optima.

\[P(\text{accept worse move}) = \exp\!\left(-\frac{\Delta E}{T}\right), \qquad T \leftarrow \alpha T \ (\alpha \approx 0.95)\]

In words: when a proposed move makes things worse by amount \(\Delta E\), accept it anyway with a probability that shrinks as the move gets worse and shrinks as the temperature cools — so big mistakes and cold temperatures both make acceptance rare.

Also written (acceptance threshold form): accept the worse move iff a uniform draw \(u \sim U(0,1)\) satisfies \(u < e^{-\Delta E / T}\); equivalently, accept iff \(-T \ln u > \Delta E\).

A worked number nails the intuition: with a worsening of \(\Delta E = 2\), at a hot \(T = 10\) the acceptance probability is \(e^{-0.2} \approx 0.82\) (almost always taken), but at a cold \(T = 0.5\) it is \(e^{-4} \approx 0.018\) (almost never) — the same bad move that’s freely accepted while hot is firmly rejected once cold.

import random, math
f = lambda x: (x - 2)**2 + 3*math.sin(5*x)   # bumpy: many local minima
x, T = random.uniform(-5, 5), 5.0
best = x
for _ in range(2000):
    nbr = x + random.uniform(-0.5, 0.5)      # random neighbor
    dE = f(nbr) - f(x)
    if dE < 0 or random.random() < math.exp(-dE / T):  # accept better, or worse w.p. exp(-dE/T)
        x = nbr
    if f(x) < f(best): best = x
    T *= 0.995                               # cool down
print(round(best, 3), round(f(best), 3))     # lands in a deep basin, not the first one hit

A ready-made SA lives in scipy.optimize under the name dual annealing, which combines classical SA with a local polish step — handy for bumpy continuous objectives:

from scipy.optimize import dual_annealing
import numpy as np
f = lambda x: (x[0] - 2)**2 + 3*np.sin(5*x[0])   # same bumpy 1-D objective
res = dual_annealing(f, bounds=[(-5, 5)])
print(round(res.x[0], 3), round(res.fun, 3))     # global basin, not the nearest one

Tabu search is greedy hill-climbing made smart with memory. Plain hill-climbing always moves to the best neighbor and gets stuck oscillating around a local optimum. Tabu search forces continued movement by keeping a tabu list — a short-term memory of recently visited solutions (or recent moves) that are forbidden, so the search cannot immediately backtrack. It always takes the best allowed neighbor, even if that means temporarily worsening, and the tabu list pushes it into unexplored territory. An aspiration criterion overrides the ban if a tabu move would yield a new all-time best.

A short hand-trace makes the memory mechanism vivid. Imagine maximizing over states whose neighbor values are listed below, with a tabu tenure of 2 (a move stays forbidden for two iterations):

Step	Current (value)	Best allowed move	Tabu list (forbidden)	Note
0	\(S_0\) (10)	→ \(S_1\) (14)	{}	uphill, accept
1	\(S_1\) (14)	→ \(S_2\) (12)	{\(S_0\)}	best neighbor is worse; \(S_0\) banned so no backtrack
2	\(S_2\) (12)	→ \(S_3\) (18)	{\(S_0,S_1\)}	forced exploration pays off
3	\(S_3\) (18)	→ \(S_4\) (16)	{\(S_1,S_2\)}	\(S_0\) ban expired; keeps climbing elsewhere

At step 1 the only improving move would be back to \(S_0\), but the tabu list forbids it — so the search accepts a worse state \(S_2\), and that detour is exactly what lets it reach the better basin at \(S_3\). The flowchart formalizes the loop:

flowchart TD
    A[Current solution] --> B[Generate neighbors]
    B --> C[Pick best neighbor<br/>NOT on tabu list]
    C --> D{New global best?}
    D -->|yes, even if tabu| E[Aspiration: accept it]
    D -->|no| F[Accept best allowed]
    E --> G[Push move onto tabu list<br/>drop oldest entry]
    F --> G
    G --> A

A minimal tabu loop is short enough to keep in your head — note the tabu list is just a deque with a max length so old bans expire automatically:

from collections import deque

def tabu_search(start, neighbors, value, tenure=3, steps=50):
    cur = best = start
    tabu = deque(maxlen=tenure)               # auto-drops oldest ban
    for _ in range(steps):
        cands = [n for n in neighbors(cur) if n not in tabu]
        if not cands: break
        cur = max(cands, key=value)           # best *allowed* neighbor
        tabu.append(cur)
        if value(cur) > value(best): best = cur
    return best
# ponytail: deque(maxlen=) is the whole tabu mechanism; no custom ring buffer needed

	Simulated Annealing	Tabu Search
Escape mechanism	Probabilistic uphill acceptance	Forbidden recent moves (memory)
Memory	Memoryless (Markov)	Short-term tabu list
Key knob	Cooling schedule \(\alpha, T_0\)	Tabu tenure (list length)
Determinism	Stochastic	Largely deterministic

Warning

SA’s performance lives and dies by its cooling schedule. Cool too fast and you “quench” into a poor local optimum (no time to explore); cool too slowly and you waste an enormous budget. Start with geometric cooling \(T \leftarrow 0.95\,T\) and an initial \(T\) large enough that ~80% of worse moves are accepted at the start — then tune.

35.6 — Fuzzy Logic & Fuzzy Control

Fuzzy logic is a slight outsider in this chapter — it is not a search algorithm but a way of reasoning with vagueness, and it pairs naturally with metaheuristics (which are often used to tune fuzzy systems). Classical Boolean logic forces every statement to be exactly true (1) or false (0): at what temperature does “warm” become “hot”? Fuzzy logic replaces the hard cliff with a degree of truth in \([0,1]\). A temperature of 28°C might be “warm” to degree 0.33 and “hot” to degree 0.33 simultaneously — closer to how humans actually reason.

The bridge is a membership function \(\mu_A(x)\), which maps each value to its degree of belonging in fuzzy set \(A\). Triangular and trapezoidal shapes are common because they are cheap and intuitive. In the chart below the three fuzzy sets are cold (peak at 24°C), warm (triangle \(18 \to 24 \to 30\)), and hot (triangle \(26 \to 32 \to 40\)); the dashed query line at 28°C crosses warm at 0.33 and hot at 0.33.

The triangular membership function itself has a clean closed form, the ramp-up-then-ramp-down shape defined by three corners \(a < b < c\):

\[\mu(x;\,a,b,c) = \max\!\left(0,\; \min\!\left(\frac{x-a}{b-a},\; \frac{c-x}{c-b}\right)\right)\]

In words: belonging climbs linearly from 0 at the left corner \(a\) up to 1 at the peak \(b\), then falls linearly back to 0 at the right corner \(c\), and is flat zero outside \([a,c]\).

Also written (piecewise form): \(\mu(x) = \frac{x-a}{b-a}\) for \(a \le x \le b\), \(\;\frac{c-x}{c-b}\) for \(b \le x \le c\), and \(0\) otherwise.

A fuzzy controller turns vague rules into smooth control in three stages, the Mamdani pipeline:

Fuzzification — convert crisp inputs into membership degrees (28°C → warm 0.33, hot 0.33).
Rule evaluation — apply human-written IF–THEN rules, e.g. IF temp is hot THEN fan is high. The rule fires to the degree its antecedent is true; AND is min, OR is max.
Defuzzification — combine all fired rule outputs back into one crisp number, typically the centroid (center of gravity) of the aggregated output shape.

The pipeline below shows a crisp reading flowing through those three stages and out the other side as one crisp fan speed, with the signal pulsing along the way.

# Triangular membership + a 2-rule fan controller
def tri(x, a, b, c):                      # 0 outside [a,c], peak 1 at b
    if x <= a or x >= c: return 0.0
    return (x - a)/(b - a) if x < b else (c - x)/(c - b)

temp = 28
warm = tri(temp, 18, 24, 30)              # (30-28)/(30-24) = 0.33
hot  = tri(temp, 26, 32, 40)              # (28-26)/(32-26) = 0.33
# rules: warm->fan 40%, hot->fan 90%; defuzzify by weighted centroid
num = warm*40 + hot*90
fan = num / (warm + hot) if (warm + hot) else 0
print(round(warm, 2), round(hot, 2), round(fan, 1))   # 0.33 0.33 65.0
assert abs(fan - 65.0) < 0.5              # both rules equal -> fan midway 40..90

The weighted-centroid defuzzification used above is the discrete center-of-gravity:

\[y^{*} = \frac{\sum_k w_k\, c_k}{\sum_k w_k}\]

In words: the crisp output is the firing-strength-weighted average of each rule’s target value — rules that fire harder pull the answer toward their own recommendation.

Also written (our two-rule case): \(y^{*} = \dfrac{w_{\text{warm}}\cdot 40 + w_{\text{hot}}\cdot 90}{w_{\text{warm}} + w_{\text{hot}}} = \dfrac{0.33\cdot 40 + 0.33\cdot 90}{0.66} = 65\).

Because warm and hot fire equally at 28°C, the controller settles the fan exactly halfway between the two rule targets — 65% — instead of slamming between 40% and 90% at a hard threshold. That smoothness is the whole point. For a second taste, nudge the reading to 29°C: now warm = (30−29)/6 ≈ 0.17 while hot = (29−26)/6 = 0.50, so the fan glides up to (0.17·40 + 0.50·90)/0.67 ≈ 77% — one degree warmer, and the fan eases higher with no jolt.

Why it matters: fuzzy controllers gave robust, human-tunable control for appliances, cameras, subway braking, and cement kilns decades before deep learning — no training data required, just expert rules, and crucially smooth outputs (no jarring on/off switching at a threshold). Metaheuristics enter when you want to learn the membership-function shapes or rule weights automatically: a GA optimizing a fuzzy controller is a classic and powerful hybrid.

Tip

Fuzzy logic is the tool when domain experts can state rules in words (“if the load is heavy and the slope is steep, downshift hard”) but you have little labeled data. It encodes expertise directly and degrades gracefully near boundaries.

35.7 — Constraints, Penalties & Multi-Objective Search

Real problems rarely say “just maximize \(f\).” They say “maximize profit subject to staying under budget and visiting every city exactly once,” or worse, “maximize performance and minimize cost” — two goals that fight each other. Metaheuristics need a way to fold these wrinkles in, because a bit-string mutation or a subtree swap will happily produce infeasible candidates that violate the rules.

Handling constraints. The intuition is a fence with a slope rather than a wall: instead of rejecting infeasible candidates outright (which can leave evolution with nothing to climb), you let them live but dock their fitness in proportion to how badly they cheat. This is the penalty method — the same idea as in classical constrained optimization, just applied to a fitness function:

\[f_{\text{pen}}(x) = f(x) - \lambda \sum_{i} \max\!\big(0,\; g_i(x)\big)\]

In words: start from the raw objective, then subtract a penalty for every constraint that’s violated, scaled by how far over the line the candidate is — small violations cost a little, big violations cost a lot.

Also written (for a maximization problem with violation amounts \(v_i = \max(0, g_i(x))\)): \(f_{\text{pen}}(x) = f(x) - \lambda\, \|\mathbf{v}\|_1\), where \(\lambda\) controls how harshly cheating is punished.

A knapsack example makes it concrete: items worth \([60, 100, 120]\) with weights \([10, 20, 30]\) and a capacity of 50. The candidate “take all three” has value 280 but weight 60 — ten over budget. With \(\lambda = 10\), its penalized fitness is \(280 - 10 \times 10 = 180\), which scores worse than the feasible “take items 2 and 3” (value 220, weight 50, no penalty). Selection then naturally drifts toward feasibility without ever hard-banning the overweight candidate.

Two other tactics worth knowing: repair (after mutation, deterministically fix the candidate back into feasibility — e.g. drop random items until under capacity) and specialized operators (design crossover/mutation that can only produce valid solutions, like an order-crossover for TSP that always yields a legal tour).

Multiple objectives & Pareto fronts. When two goals conflict, there is no single best answer — only a frontier of trade-offs. A solution is Pareto-dominated if some other solution is at least as good on every objective and strictly better on one; the solutions nobody dominates form the Pareto front. The classic algorithm is NSGA-II, which sorts the population into domination layers and breeds from the best layers while preserving spread along the front.

# tiny Pareto filter: keep points not dominated on (perf↑, cost↓)
pts = [(0.9, 30), (0.7, 20), (0.95, 45), (0.6, 18), (0.8, 25)]   # (perf, cost)
def dominated(p, others):
    return any(o[0] >= p[0] and o[1] <= p[1] and o != p for o in others)
front = [p for p in pts if not dominated(p, pts)]
print(sorted(front))      # the non-dominated trade-off set
assert (0.95, 45) in front and (0.6, 18) in front   # both extremes survive

In pymoo (the standard Python multi-objective library) the same idea is one optimizer call: NSGA2 returns the whole front, and you pick the trade-off point that fits your budget after seeing the options — which is exactly how engineers like to choose.

Tip

When your problem has hard constraints, decide early between penalty (simple, but tune \(\lambda\)), repair (fast convergence, needs a fixer), and feasibility-preserving operators (cleanest, most design effort). For conflicting objectives, resist collapsing them into one weighted sum too soon — a Pareto front shows the trade-offs the weighted sum would have hidden.

35.8 — When Metaheuristics Beat Gradient Methods (and the No-Free-Lunch Caveat)

Gradient methods (Chapter 03) are the right default when they apply: given a smooth, differentiable objective, gradient descent and its relatives converge fast and scale to billions of parameters. Metaheuristics earn their keep precisely where those assumptions break.

Reach for a metaheuristic when:

No gradient exists or it’s useless — the objective is a black-box simulation, a discrete/combinatorial space (routing, scheduling, feature subsets), or non-differentiable (e.g. accuracy, which is piecewise-constant).
The landscape is rugged — many local optima and plateaus where a hill-climber stalls; population diversity and uphill acceptance escape traps.
The objective is noisy or expensive and you only get function values, not derivatives.
Dimensionality is modest. Most metaheuristics scale poorly past a few hundred variables — they need many function evaluations and have no gradient to amortize that cost.

flowchart TD
    A[Optimization problem] --> B{Smooth &<br/>differentiable?}
    B -->|yes| C{Gradient cheap<br/>to compute?}
    C -->|yes| D[Use gradient methods<br/>Ch. 03 — fast, scalable]
    C -->|no| E[Consider ES / CMA-ES]
    B -->|no| F{Continuous or<br/>combinatorial?}
    F -->|continuous, rugged| G[PSO / CMA-ES / SA]
    F -->|combinatorial| H[GA / ACO / Tabu]

Now the essential humility. The No-Free-Lunch (NFL) theorem (Wolpert & Macready, 1997) states that averaged over all possible objective functions, every optimization algorithm has identical expected performance. There is no universally best optimizer — including any of the nature-inspired methods in this chapter. Any algorithm that excels on one class of problems must, by conservation, do correspondingly worse on another.

The practical reading is not nihilistic: real problems are not drawn uniformly from all possible functions — they have structure (continuity, locality, modularity). An algorithm performs well exactly when its built-in assumptions match that structure. So NFL is a warning against two things: (1) believing your favorite metaheuristic is “best” in general, and (2) the cottage industry of renaming the same search loop after yet another animal. The skill is matching the method’s inductive bias to your problem, then tuning — not searching for a silver bullet that the math says cannot exist.

Warning

Don’t reach for a genetic algorithm when gradient descent would do. If your loss is differentiable and smooth, a metaheuristic typically needs orders of magnitude more evaluations to reach the same point. Metaheuristics are the tool for when calculus can’t start — not a fashionable substitute for when it can.

35.9 — Quick reference

Term / formula	Meaning in one line	When / why it matters
Explore ↔︎ exploit	The single dial: wander widely vs. refine the current best	The lens for understanding every method here
Genetic algorithm (GA)	Evolve a population by selection + crossover + mutation + elitism	Discrete/combinatorial or black-box objectives over fixed-length encodings
Tournament selection	Pick \(k\) at random, keep the fittest as parent	Robust default; avoids the diversity collapse of roulette selection
Fitness-proportional \(P_i = f_i/\sum_j f_j\)	Parent chance = its slice of total fitness	The exact breeding metaphor; fragile to one dominant individual
Elitism	Copy the top few individuals unchanged into the next generation	One line; guarantees best fitness never decreases
Premature convergence	Diversity collapses, search freezes on a local optimum	The classic GA failure — fix with mutation, larger pop, tournament
Genetic programming (GP)	Evolve program trees (functions + leaves), score by running them	Symbolic regression — invents the form of a solution, interpretable
Parsimony pressure \(\text{err}+c\cdot\text{size}\)	Penalize tree size in fitness	Fights GP code bloat; shorter program wins ties
Evolution strategy (ES)	Mutate real vectors with self-adapting step size \(\sigma\)	Continuous, noisy, non-differentiable, low-dimensional objectives
1/5 success rule	Grow \(\sigma\) if >1/5 of tweaks help, shrink otherwise	Gradient-free adaptive step size
CMA-ES	ES that learns the full covariance of the mutation cloud	Strongest off-the-shelf optimizer for expensive low-\(d\) continuous problems
ES gradient \(\hat g=\frac{1}{n\sigma}\sum f(\theta+\sigma\epsilon_k)\epsilon_k\)	Fake a gradient by reward-weighting random probes	Backprop-free, embarrassingly parallel policy search
NEAT	Neuroevolution that grows network nodes + links (topology)	Evolve small-net structure, not just weights
PSO \(v\leftarrow wv + c_1r_1(p-x)+c_2r_2(g-x)\)	Particles pulled by inertia + personal best + swarm best	Simple continuous optimization; easier to code than a GA
ACO \(P(i\!\to\!j)\propto\tau^\alpha\eta^\beta\)	Ants pick edges by pheromone × nearness; evaporate + reinforce	Graph/routing problems (TSP, vehicle routing) built step by step
Simulated annealing \(P=e^{-\Delta E/T}\)	Accept worse moves with cooling probability	Escape local optima in a single-solution trajectory search
Tabu search	Best allowed neighbor; recent moves forbidden via a `deque`	Memory-driven escape; largely deterministic
Membership \(\mu(x;a,b,c)\)	Degree of truth in \([0,1]\), triangular ramp up/down	Fuzzy logic — smooth expert-rule control with no training data
Penalty \(f-\lambda\sum\max(0,g_i)\)	Dock fitness in proportion to constraint violation	Fold hard constraints into a fitness function softly
Pareto front	Non-dominated trade-offs between conflicting objectives	Multi-objective search (NSGA-II); pick the point after seeing options
No-Free-Lunch theorem	No optimizer beats all others averaged over all functions	Match the method’s bias to your problem; no silver bullet exists

35.10 — Key takeaways

Metaheuristics are gradient-free, general-purpose search strategies for black-box, combinatorial, and multi-modal problems where calculus cannot start; every one of them is a different way to tune the single explore ↔︎ exploit dial.
Genetic algorithms evolve a population via selection, crossover, mutation, and elitism; fitness-proportional selection literally weights picks by \(f_i / \sum_j f_j\), and watch for premature convergence. Genetic programming evolves programs/formulas (scored by running the tree on data) and fights code bloat with parsimony pressure.
Evolution strategies mutate real vectors with self-adapting step sizes via the windowed 1/5 success rule (CMA-ES being the strong modern variant); the ES gradient estimate \(\hat g = \frac{1}{n\sigma}\sum f(\theta+\sigma\epsilon_k)\epsilon_k\) powers scalable, backprop-free policy search, and neuroevolution (e.g. NEAT) evolves network weights and even topology.
Swarm intelligence: PSO swirls particles toward personal and global bests in continuous space; ACO lays pheromone (with evaporation + reinforcement) to solve graph/routing problems via a softmax over edge desirability.
Simulated annealing escapes local optima by accepting worse moves with probability \(e^{-\Delta E/T}\) under a cooling schedule; tabu search uses a forbidden-move memory (a deque) to keep exploring.
Fuzzy logic reasons with degrees of truth via membership functions and IF–THEN rules, giving smooth, expert-driven control without training data (28°C → warm 0.33, hot 0.33 → fan 65% by weighted centroid).
Constraints fold in via penalties, repair, or feasibility-preserving operators; conflicting objectives yield a Pareto front (NSGA-II), not a single answer.
The No-Free-Lunch theorem means no optimizer is universally best — match the method’s inductive bias to your problem’s structure, and prefer gradient methods whenever the objective is smooth and differentiable.

35.11 — See also

Optimization — gradient descent, convexity, and the smooth-objective methods these metaheuristics complement.
Reinforcement Learning — neuroevolution and ES are competitive, gradient-free alternatives for policy search.
Calculus & Differentiation — why differentiability is the dividing line that sends you to gradient methods or metaheuristics.
Search & Problem Solving — classical and heuristic search (A*, hill-climbing) that trajectory-based metaheuristics generalize.
Planning, Constraint Satisfaction & Game Playing — combinatorial problems where GA, ACO, and tabu search are routinely applied.
Neural Networks (Core) — the backprop-trained networks that neuroevolution offers an alternative path to.

↪ The thread continues → Chapter 36 · 🔍 Explainable AI & Interpretability

We’ve built powerful models across two traditions; now the conscience chapters. The first asks the simplest hard question: why did the model decide that?

📖 All chapters | ← 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing | 36 · 🔍 Explainable AI & Interpretability →