Chapter 34 — 🗺️ Planning, Constraint Satisfaction & Game Playing

📖 All chapters | ← 33 · 📖 Knowledge Representation & Reasoning | 35 · 🧬 Evolutionary Computation & Metaheuristics →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

This chapter is about the classical, symbolic side of AI: getting a machine to figure out what to do by searching through possibilities rather than learning from data. Three closely related problems sit here — building a sequence of actions to reach a goal (planning), assigning values that satisfy a web of rules (constraint satisfaction), and choosing moves against an opponent (game playing). They all reduce to clever search over enormous spaces, and the algorithms below — backtracking, arc-consistency, alpha-beta, MCTS — are the engines behind everything from logistics schedulers to AlphaGo.

🧭 In context: Classical & Symbolic AI · used for automated decision-making, scheduling, puzzle-solving, and game agents · the one key idea: structure the world as states, actions, and constraints, then search smartly instead of trying everything.

💡 Remember this: planning, constraint satisfaction, and game playing are all the same move — turn the problem into states/variables/constraints, then search that space cleverly with pruning and heuristics instead of trying every possibility.

The three problems are really three flavours of the same activity — turning a tangled question into a shape a search algorithm can chew on. The map below shows where each lives and which engine drives it.

sequence A, GraphPlan, SATPlan Constraint satisfaction variables + constraints find an assignment backtrack, AC-3, SAT Game playing states + an opponent find a move* minimax, αβ, MCTS

34.1 — Automated Planning (STRIPS / PDDL)

Automated planning asks a deceptively simple question: given where I am now, where I want to be, and what actions I can take, what sequence of actions gets me there? Think of packing for a trip. The fridge is empty, the goal is a packed suitcase by the door, and your actions are things like “fold shirt,” “zip bag,” “find passport.” You cannot zip the bag before the shirts are in, and you cannot pack the passport before you find it. Planning is the discipline of working out those orderings automatically.

The classic framing is STRIPS (Stanford Research Institute Problem Solver). A state is simply a set of facts that are currently true — a snapshot of the world. An action has three parts: preconditions (facts that must already hold before you can apply it), an add list (facts the action makes true), and a delete list (facts the action makes false). The goal is a set of facts you want to be true at the end. A plan is an ordered list of actions that transforms the initial state into one satisfying the goal.

PDDL (Planning Domain Definition Language) is the standard text format for writing all this down. A domain file declares the action schemas (the rules of the world), and a problem file gives the specific objects, initial state, and goal. Separating the two is what makes planning reusable: one “blocks world” domain can be paired with thousands of different starting configurations without rewriting the rules each time.

Here is the same unstack action written in real PDDL, so you can see the text format planners actually consume:

(define (domain blocks)
  (:requirements :strips)
  (:action unstack
    :parameters (?x ?y)
    :precondition (and (on ?x ?y) (clear ?x) (handempty))
    :effect (and (holding ?x) (clear ?y)
                 (not (on ?x ?y)) (not (clear ?x)) (not (handempty)))))

The (not …) effects are the delete list; the plain effects are the add list. Tools like Fast Downward or the pddl/unified-planning Python libraries read this file plus a problem file and return a plan.

Here is a tiny blocks world: a robot arm stacks blocks A and B on a table. The facts we track are on(x,y) (x is on y), onTable(x), clear(x) (nothing is on top of x), handEmpty, and holding(x).

# A STRIPS action = (name, preconditions, add, delete)
unstack = ("unstack(A,B)",
    {"on(A,B)", "clear(A)", "handEmpty"},   # preconditions: must all hold
    {"holding(A)", "clear(B)"},              # add: now true
    {"on(A,B)", "clear(A)", "handEmpty"})    # delete: now false

def apply(state, action):
    name, pre, add, dele = action
    assert pre <= state, f"{name}: preconditions not met"   # subset check
    return (state - dele) | add               # delete first, then add

s0 = {"on(A,B)", "onTable(B)", "clear(A)", "handEmpty"}
s1 = apply(s0, unstack)        # now holding(A), B is clear
assert s1 == {"onTable(B)", "clear(B)", "holding(A)"}   # self-check
print(s1)

The mechanism is just set arithmetic: confirm the preconditions are a subset of the current state, subtract the delete list, then add the add list. That is the whole engine for applying one action. The genuinely hard part is which actions, in what order — and that is a search over the graph of reachable states.

The animation below shows one action firing as a “diff”: the precondition guard checks, the delete facts fade out, and the add facts slide in.

We can write the state-transition rule as a single formula. Applying action \(a\) to state \(s\) gives the successor:

\[ \text{Result}(s, a) = \big(s \setminus \text{Del}(a)\big) \cup \text{Add}(a), \quad \text{defined only when } \text{Pre}(a) \subseteq s \]

In words: to take an action, first check all its preconditions are already true in the current state; then remove the facts it makes false and add the facts it makes true.

Also written: \(\text{Result}(s,a) = (s - \text{Del}(a)) + \text{Add}(a)\) with the guard \(\text{Pre}(a) \subseteq s\) — exactly the (state - dele) | add line of code above.

flowchart LR
    A["Initial state<br/>on(A,B), handEmpty"] -->|unstack A,B| B["holding A<br/>clear B"]
    B -->|putdown A| C["onTable A<br/>handEmpty"]
    C -->|pickup B| D["holding B"]
    D -->|stack B,A| E["Goal:<br/>on(B,A)"]

Tip

Intuition: a STRIPS action is a “diff” on the world — the preconditions are the guard that says whether the patch may apply, and the add/delete lists are the patch itself. Planning is finding a sequence of diffs that turns the start commit into the goal commit.

Forward versus backward search

There are two directions you can search. Forward (progression) planning starts from the initial state and applies every applicable action, generating successor states until it stumbles onto the goal. It is natural and the states are always fully concrete, which makes them easy to evaluate — but the branching factor is brutal, because at every step every applicable action is a candidate, including thousands that have nothing to do with the goal.

Backward (regression) planning runs the other way. It starts from the goal and asks: “what action could have produced these goal facts, and what would the world have needed to look like just before it?” This is more focused, because it only ever considers actions that are relevant to achieving the goal — irrelevant actions never enter the picture. The cost is that you now reason about partial states (sets of subgoals rather than complete snapshots), which is fiddlier and easier to get wrong.

Direction	Starts from	Branching	Strength	Weakness
Forward (progression)	initial state	high (all applicable actions)	states are concrete, heuristics easy	explores many irrelevant actions
Backward (regression)	goal	lower (only goal-relevant actions)	ignores irrelevant actions	partial-state reasoning is tricky

In practice, modern planners overwhelmingly run forward search guided by powerful, automatically-derived heuristics. The most famous is the relaxed plan heuristic: temporarily pretend actions have no delete lists (so nothing ever becomes false), solve that much easier problem, and use its length as an estimate of the true distance to the goal. Because the relaxed problem can only be easier, the estimate never overshoots — exactly the admissibility property A* needs. The A*/heuristic-search machinery these planners reuse lives in Chapter 32.

Partial-order and hierarchical planning

Partial-order planning (POP) refuses to commit to a total ordering of actions when it does not have to. Instead of insisting “do A, then B, then C,” it produces a plan carrying only the ordering constraints that genuinely matter. The canonical example is getting dressed: you must put socks on before shoes, but the left foot and right foot are completely independent — forcing an order between them is an arbitrary choice you might later have to undo. This is the least-commitment principle: leave every decision open as long as you can, so you never backtrack over a choice that never mattered. The output is a partial order that can be linearized (flattened into a sequence) many different ways — which is also exactly what a parallel executor wants, since unordered actions can run simultaneously.

Hierarchical Task Network (HTN) planning attacks the size of the problem from a different angle: it plans at multiple levels of abstraction. A high-level task such as travel(home, airport) is recursively broken down through methods — recipes that decompose one task into subtasks — until everything bottoms out in primitive, directly-executable actions like call_taxi, ride, pay. This mirrors how people actually plan: settle the coarse shape first (“I’ll take a taxi”), fill in details later. It scales beautifully because the method library encodes a great deal of human knowledge up front, which prunes the search space dramatically rather than rediscovering good structure from scratch every time.

flowchart TD
    T["travel(home → airport)"] --> M{which method?}
    M -->|by taxi| X1["call_taxi"] --> X2["ride_taxi"] --> X3["pay_driver"]
    M -->|by car| Y1["walk_to_car"] --> Y2["drive"] --> Y3["park"]

Warning

Common mistake: forgetting the frame problem. STRIPS assumes that anything not mentioned in an action’s add or delete lists is left completely unchanged. That assumption is what makes the model tractable, but it bites hard if you under-specify an action. If unstack forgets to delete handEmpty, the planner will cheerfully believe the hand is both empty and holding a block, and build physically impossible plans on top of that lie. Whenever you model an action, ask not just “what does this make true?” but “what does this stop being true?”

GraphPlan and SAT-based planning

So far we have searched the state graph directly. Two landmark ideas reframe planning so that other machinery does the heavy lifting.

The intuition for GraphPlan is a “ripple of possibility.” Imagine dropping the initial state into a pond: in one step, every fact reachable by one action lights up; in two steps, everything reachable by two; and so on. GraphPlan builds exactly this layered planning graph — alternating layers of facts and actions — propagating forward and recording mutex (mutually-exclusive) pairs: two actions that interfere, or two facts that cannot co-occur. Once all goal facts appear in a layer with no mutex among them, GraphPlan searches backward through that compact graph for a real plan. The graph is a cheap relaxation that both bounds the plan length and supplies a strong heuristic — and it is where the relaxed-plan heuristic of forward planners comes from.

SATPlan turns the plan into a giant true/false puzzle. First you guess a length \(k\) — “let me look for a plan with at most \(k\) steps.” Then you make one yes/no variable for every “did action \(a\) happen at step \(t\)?” and every “is fact \(f\) true at step \(t\)?” You write down the rules connecting them (the start, the goal at the end, and what each action needs and changes) as a big logic formula. A SAT solver then tries to make the whole formula true at once. If it can, the actions it marked “yes” spell out a working \(k\)-step plan. If it can’t, no plan that short exists, so you raise \(k\) by one and try again.

flowchart LR
    P["Planning problem"] --> Q{"horizon k"}
    Q --> R["Encode as one<br/>CNF formula Φ_k"]
    R --> S["SAT solver"]
    S -->|SAT| T["Read true action vars<br/>= a k-step plan"]
    S -->|UNSAT| U["k ← k+1, retry"]
    U --> R

This is the deep reason SAT keeps reappearing in this chapter: planning, scheduling, and verification all quietly compile down to satisfiability. Modern optimal planners blend both ideas — planning-graph heuristics to guide search, SAT/SMT back-ends to settle the combinatorics.

34.2 — Constraint Satisfaction Problems

A Constraint Satisfaction Problem (CSP) is defined by exactly three ingredients: a set of variables, a domain of possible values for each variable, and a set of constraints restricting which combinations of values are allowed. A solution is an assignment of one value to every variable that violates no constraint. The crucial difference from planning is that order does not matter — there is no sequence of actions, just a consistent assignment you are hunting for. This makes CSPs a wonderfully general hammer: scheduling, timetabling, hardware configuration, and a huge range of puzzles all fit the mold.

The textbook example is map coloring: color each region of a map so that no two adjacent regions share a color, using as few colors as possible. The variables are the regions, the domain for each is {red, green, blue}, and there is one constraint region ≠ neighbor for every adjacent pair.

Each line connects two regions that must differ; a valid coloring is one where no line ever joins two same-colored circles. Notice that WA and Q can safely share a color (red) because they are not adjacent — only the constraints matter, not global uniqueness.

Real-world CSPs. University exam timetabling (don’t schedule two exams a student shares in the same slot), register allocation in a compiler (variables = program values, colors = CPU registers — literally graph coloring), crew rostering for airlines, and configuring a car’s option packages so no incompatible parts are selected together. All are the same three ingredients underneath.

Backtracking search

The baseline solver is backtracking: assign one variable at a time; if the partial assignment is still consistent, recurse to the next variable; if some variable has no legal value left, undo the most recent choice (backtrack) and try a different value. It is nothing more than depth-first search over partial assignments, but it is the foundation everything else builds on.

def backtrack(assignment, vars, domains, neighbors):
    if len(assignment) == len(vars):      # all assigned → solved
        return assignment
    var = next(v for v in vars if v not in assignment)   # pick an unassigned var
    for value in domains[var]:
        # consistent? no already-assigned neighbor uses the same color
        if all(assignment.get(n) != value for n in neighbors[var]):
            assignment[var] = value
            result = backtrack(assignment, vars, domains, neighbors)
            if result: return result
            del assignment[var]            # undo on failure, try next value
    return None

Naive backtracking can explore a tree of size up to \(d^n\) for \(n\) variables of domain size \(d\), so the entire art lies in ordering the choices and pruning doomed branches early. Two cheap, high-impact heuristics carry most of the weight. MRV (Minimum Remaining Values) says: assign the most-constrained variable first — the one with the fewest legal values left — because if it is going to fail, you want to find out now rather than after a thousand pointless assignments. (When a CSP is too big to search systematically, metaheuristic local search takes over — see Chapter 35.) LCV (Least Constraining Value) says: among the values for the variable you picked, try the one that rules out the fewest options for its neighbors, keeping your future flexibility as high as possible. MRV chooses where to look next; LCV chooses what to try first there.

Constraint propagation and arc-consistency

Checking constraints only lazily, at assignment time, wastes information. It is far better to propagate constraints — to deduce their consequences before guessing. The workhorse here is arc-consistency. Picture each constraint as a directed arc \(X \rightarrow Y\). That arc is consistent when, for every value remaining in \(X\)’s domain, there exists at least one value in \(Y\)’s domain that satisfies the constraint. If some value in \(X\) has no such “support” in \(Y\), that value can never appear in any solution, so you simply delete it. Deleting it may in turn knock out the support for values in other variables pointing at \(X\), so you re-examine those arcs too. Repeating this until nothing changes is the AC-3 algorithm.

We can state arc-consistency for the arc \(X \to Y\) as a condition on \(X\)’s domain \(D_X\):

\[ \forall\, x \in D_X \;\;\exists\, y \in D_Y \;:\; (x, y) \text{ satisfies } C_{XY} \]

In words: every value left for \(X\) must have at least one partner value in \(Y\) that keeps the constraint happy; values with no partner are deleted.

Also written: \(D_X \leftarrow \{\, x \in D_X : \exists\, y \in D_Y,\ C_{XY}(x,y)\,\}\) — keep only the supported values.

def ac3(domains, neighbors, constraint):
    queue = [(x, y) for x in domains for y in neighbors[x]]
    while queue:
        x, y = queue.pop()
        revised = False
        for vx in set(domains[x]):
            # if NO value in y's domain is consistent with vx, drop vx
            if not any(constraint(vx, vy) for vy in domains[y]):
                domains[x].remove(vx); revised = True
        if revised:
            if not domains[x]: return False        # domain wiped → no solution
            queue += [(z, x) for z in neighbors[x] if z != y]   # recheck arcs into x
    return True

The animation below shows that ripple: fix one variable, and the now-impossible value drains out of every neighbour’s domain on its own.

Worked example. Start map-coloring with domains {R, G, B} on every region. Now suppose we fix SA = blue. Arc-consistency immediately fires: because SA is a neighbor of WA, NT, and Q, the value blue loses its support for those three and is deleted from each of their domains, leaving them all at {R, G}. We did no guessing, yet three domains shrank for free. WA and NT are still neighbors and still arc-consistent (WA can be R while NT is G, and vice versa), so propagation stops — but the search tree we now face is vastly smaller. Interleaving AC-3 inside the backtracking loop, re-running propagation after every assignment, is called MAC (Maintaining Arc Consistency), and it prunes enormous subtrees before they are ever explored.

Solving a CSP with a real framework

You almost never hand-roll the solver. Google OR-Tools ships a battle-tested CP-SAT engine; here is the whole Australia map-coloring problem in a dozen lines:

from ortools.sat.python import cp_model

m = cp_model.CpModel()
regions = ["WA", "NT", "SA", "Q", "NSW", "V", "T"]
# 0,1,2 stand for red, green, blue
color = {r: m.NewIntVar(0, 2, r) for r in regions}

edges = [("WA","NT"),("WA","SA"),("NT","SA"),("NT","Q"),
         ("SA","Q"),("SA","NSW"),("SA","V"),("Q","NSW"),("NSW","V")]
for a, b in edges:
    m.Add(color[a] != color[b])           # adjacent regions differ

solver = cp_model.CpSolver()
assert solver.Solve(m) in (cp_model.OPTIMAL, cp_model.FEASIBLE)
print({r: ["red","green","blue"][solver.Value(color[r])] for r in regions})

CP-SAT folds MRV-style ordering, propagation, and clause learning together internally — you just declare variables and constraints. The same m.Add(... != ...) pattern scales straight up to exam timetables and rosters.

Sudoku as a CSP

Sudoku is a CSP in disguise: 81 variables (the cells), each with domain {1..9}, plus all-different constraints over every row, every column, and every 3×3 box. Filling in a cell triggers propagation — that value vanishes from the domains of all 20 of the cell’s “peers” (its row, column, and box). Plain constraint propagation, combined with the naked single rule (if a cell’s domain ever shrinks to a single value, assign it immediately), is enough to solve easy and medium puzzles outright with no guessing at all. Backtracking only gets called in to break ties on the genuinely hard puzzles where propagation stalls.

flowchart LR
    A["Assign a cell"] --> B["Propagate:<br/>remove value from 20 peers"]
    B --> C{"Any domain<br/>now empty?"}
    C -->|yes| D["Backtrack"]
    C -->|no| E{"All cells<br/>assigned?"}
    E -->|no| F["Pick MRV cell"] --> A
    E -->|yes| G["Solved"]

SAT — the canonical constraint problem

Boolean satisfiability (SAT) is the special, foundational case of a CSP where every variable is simply true or false and the constraints are clauses written in conjunctive normal form (CNF) — an AND of ORs, for example \((x_1 \lor \neg x_2) \land (x_2 \lor x_3)\). The question is whether any assignment of truth values makes the entire formula true. SAT holds a special place in computer science: it was the very first problem proven NP-complete (the Cook–Levin theorem, 1971), meaning every problem in NP can be re-encoded as a SAT instance. And yet, despite that worst-case hardness, modern CDCL (Conflict-Driven Clause Learning) solvers routinely dispatch real-world instances with millions of variables. They quietly power hardware verification, software dependency resolution, and even planning itself — you can encode “is there a plan of length \(k\)?” as a SAT formula and hand it to the solver.

The classic engine underneath is DPLL: pick an unassigned variable, try one truth value, and propagate unit clauses — whenever a clause has all but one literal already falsified, that last literal is forced, so you set it and cascade the consequences. On hitting a contradiction you backtrack. It is, once again, backtracking search — this time over Booleans, supercharged with propagation. CDCL adds one more trick: when it hits a conflict, it analyzes why and records a brand-new “learned clause” that prevents the solver from ever repeating that mistake elsewhere in the tree.

You don’t write CDCL yourself either. The Python python-sat (PySAT) library wraps industrial solvers like Glucose and MiniSat:

from pysat.solvers import Glucose3

# (x1 ∨ ¬x2) ∧ (x2 ∨ x3) ;  variables are 1,2,3 ; negative int = NOT
s = Glucose3()
s.add_clause([1, -2])
s.add_clause([2, 3])
print(s.solve())        # True  -> satisfiable
print(s.get_model())    # e.g. [1, -2, 3]  = x1=T, x2=F, x3=T
s.delete()

For richer constraints (integers, arithmetic, “exactly-one-of”), the Z3 SMT solver is the go-to — it speaks SAT plus theories, and powers a great deal of program verification.

Tip

Rule of thumb: if you can phrase your problem as “assign values subject to local rules,” reach for an off-the-shelf CSP or SAT solver before writing a custom search. Decades of brutal engineering (MRV, AC-3, CDCL, clause learning) are already baked into tools like MiniSat, Z3, and OR-Tools. Re-implementing that by hand almost never pays off.

34.3 — Adversarial Search and Game Playing

Games change the problem in one fundamental way: there is no longer a fixed goal state you can quietly plan your way toward, because an opponent is actively working to stop you. We focus on the cleanest case — two-player, zero-sum (your gain is exactly the opponent’s loss), perfect-information games like chess, checkers, and tic-tac-toe. The core algorithm is minimax. Label the two players MAX (you, trying to maximize the final score) and MIN (the opponent, trying to minimize it). You build out the game tree, score the leaf positions from MAX’s point of view, and then propagate those values back up: at MAX nodes you take the maximum child value, and at MIN nodes you take the minimum. The value that arrives at the root is the best outcome you can force, assuming the opponent plays perfectly against you.

flowchart TD
    R["MAX (root)<br/>= max(3,2) = 3"]
    R --> A["MIN<br/>= min(3,5) = 3"]
    R --> B["MIN<br/>= min(2,9) = 2"]
    A --> A1["3"]
    A --> A2["5"]
    B --> B1["2"]
    B --> B2["9"]

Reading the tree from the bottom: MAX is choosing between two moves. The left move leads into a MIN node, where the opponent will pick \(\min(3,5)=3\); the right move leads into a MIN node where the opponent picks \(\min(2,9)=2\). MAX therefore prefers the left move and the root value is 3. Crucially, MAX never actually gets the tempting 5 or the 9 — those are leaves a rational opponent would simply never permit, so they cannot be counted on.

The whole back-up rule is one recursive definition. Writing \(V(s)\) for the minimax value of state \(s\):

\[ V(s) = \begin{cases} \text{Utility}(s) & s \text{ is terminal} \\ \max_{a} V(\text{Result}(s,a)) & s \text{ is a MAX node} \\ \min_{a} V(\text{Result}(s,a)) & s \text{ is a MIN node} \end{cases} \]

In words: the value of a finished position is just its score; otherwise MAX takes the best child and MIN takes the worst, looking ahead through every move.

Also written (Negamax form): since \(\min_a V = -\max_a (-V)\), you can collapse both cases into one — \(V(s) = \max_a \big(-V(\text{Result}(s,a))\big)\) — by always scoring from the side-to-move’s perspective and flipping the sign each ply. This is how nearly every real engine is coded, because one routine handles both players.

def minimax(node, is_max):
    if node.is_leaf:
        return node.value
    if is_max:
        return max(minimax(c, False) for c in node.children)
    else:
        return min(minimax(c, True)  for c in node.children)

Beyond two players: expectimax and chance

Not every game is a duel against a perfect adversary. Two common twists need a different back-up rule.

When there is randomness — a dice roll in backgammon, a shuffled deck — you add chance nodes whose value is the expected (probability-weighted average) value of their children, not a max or a min. This is expectimax:

\[ V(\text{chance node}) = \sum_{i} P(i)\, V(\text{child}_i) \]

In words: at a random event, average the children’s values weighted by how likely each outcome is.

Also written: \(V = \mathbb{E}_{i \sim P}[V(\text{child}_i)]\) — the expectation over outcomes.

The everyday intuition: against a coin flip you cannot assume the worst or the best — you have to plan for the average of what might happen. A subtle consequence is that, unlike minimax, expectimax values are sensitive to the scale of your evaluation function (doubling all scores can change which gamble looks best), so the numbers must mean something, not just rank-order positions.

Worked example. You face a choice between two moves. Move A is safe: it leads straight to a position worth \(+4\). Move B rolls a die — with probability \(0.5\) you land on a \(+10\) position, and with probability \(0.5\) on a \(-2\) one. The chance node for B is worth \(0.5(10) + 0.5(-2) = +4\), exactly tying the safe move. Now double every score: A becomes \(+8\), but B becomes \(0.5(20) + 0.5(-4) = +8\) as well — still tied. The tie survived because both sides scaled together; what breaks under rescaling is comparisons where the two options weight outcomes differently, which is why expectimax needs meaningful magnitudes, not just an ordering.

For more than two players, each node stores a vector of utilities, one per player, and each player maximizes their own component — the natural generalization of minimax to three or more competitors.

Evaluation functions and depth limits

Real game trees are astronomically large — chess has on the order of \(10^{120}\) distinct positions, more than there are atoms in the observable universe. You obviously cannot search to the end of the game. Instead you search to a fixed depth and, at the cut-off, apply an evaluation function that estimates how good a non-terminal position is without looking further. For chess, the simplest sensible eval is material: sum \(9Q + 5R + 3B + 3N + 1P\) with opposite signs for the two sides, then layer on positional terms such as center control, king safety, and pawn structure. The entire strength of a classical engine boils down to two things: how deep it can search, and how good its evaluation function is.

A typical weighted evaluation is a dot product of features and weights:

\[ \text{Eval}(s) = \sum_{k} w_k\, f_k(s) = \mathbf{w}^\top \mathbf{f}(s) \]

In words: score a position by measuring a handful of features (material balance, mobility, king safety…) and adding them up, each scaled by how much it matters.

Also written: the explicit sum \(w_1 f_1(s) + w_2 f_2(s) + \cdots + w_n f_n(s)\) — and modern engines simply learn the weights \(\mathbf{w}\) (and the features) with a neural network, as in NNUE evaluations.

The famous trap here is the horizon effect: a fixed depth cut-off can hide a disaster lurking just one move beyond where the search stops. The engine evaluates a position as calm and favorable, never seeing that the opponent’s queen captures on the very next ply. The standard fix is quiescence search — when you reach the depth limit, do not stop blindly; instead keep searching through any “noisy” forcing sequences (captures, checks) until you reach a quiet position where the static evaluation can be trusted.

Alpha-beta pruning

Plain minimax wastes enormous effort evaluating branches that cannot possibly change the final decision. Alpha-beta pruning eliminates that waste while returning the exact same answer. You carry two bounds down the tree as you search: \(\alpha\) is the best (highest) value MAX can already guarantee somewhere along the current path, and \(\beta\) is the best (lowest) value MIN can already guarantee. The moment \(\alpha \ge \beta\) at a node, you stop expanding its remaining children — because one player has already found a better option elsewhere and would never let the game reach this node. That cut is a prune.

Walk it through on the tree above. MAX explores the left subtree first and learns it is worth 3, so \(\alpha = 3\). It then begins the right MIN node and immediately sees a child worth 2. That MIN node can only get worse from MAX’s perspective — its value will be at most 2 — which is already below the 3 MAX has in hand. So MAX will never choose this branch no matter what the other child (the 9) holds, and that 9 need never be examined at all. That single skipped node is the entire idea; on a real game tree the same logic prunes the vast majority of the work.

The pruned node, drawn out — the grey 9 is proven irrelevant before it is ever looked at:

def alphabeta(node, alpha, beta, is_max):
    if node.is_leaf:
        return node.value
    if is_max:
        v = -float('inf')
        for c in node.children:
            v = max(v, alphabeta(c, alpha, beta, False))
            alpha = max(alpha, v)
            if alpha >= beta: break        # β cut-off: MIN won't allow this line
        return v
    else:
        v = float('inf')
        for c in node.children:
            v = min(v, alphabeta(c, alpha, beta, True))
            beta = min(beta, v)
            if alpha >= beta: break        # α cut-off: MAX won't allow this line
        return v

With perfect move ordering, alpha-beta examines only about \(O(b^{d/2})\) nodes instead of minimax’s \(O(b^d)\) — which effectively doubles the depth you can search for the same cost. That is precisely why move ordering matters so much: if you try the likely-best moves first (captures, or the best move remembered from a shallower search via a transposition table), you trigger cut-offs as early as possible and get closer to that best-case bound.

Doing it with a framework

For building an actual playing agent you reach for a library that already implements negamax + alpha-beta. easyAI lets you describe the game and get a solver for free:

from easyAI import TwoPlayerGame, AI_Player, Negamax

class Nim(TwoPlayerGame):                 # take 1–3 from a pile; taking last loses
    def __init__(self, players):
        self.players, self.pile, self.current_player = players, 12, 1
    def possible_moves(self): return [str(m) for m in (1, 2, 3) if m <= self.pile]
    def make_move(self, m):   self.pile -= int(m)
    def is_over(self):        return self.pile <= 0
    def scoring(self):        return 100 if self.is_over() else 0   # mover-just-moved view

ai = Negamax(8)                            # search 8 plies deep with alpha-beta
game = Nim([AI_Player(ai), AI_Player(ai)])
game.play(verbose=False)
print("loser:", game.current_player)       # perfect play decided from the start

For chess specifically, the python-chess package gives you legal-move generation and board state, and you bolt your own negamax+eval on top — a common teaching project that mirrors how real engines are structured.

Warning

Common mistake: believing alpha-beta changes the result. It does not. It returns the identical minimax value — it merely skips branches it can prove are irrelevant. If your alpha-beta ever disagrees with plain minimax on the same tree, you have a bug, almost always a swapped \(\alpha\)/\(\beta\) update or a wrong cut-off condition. Treat plain minimax as your ground-truth oracle when testing.

34.4 — Monte Carlo Tree Search (UCT)

Minimax with alpha-beta lives or dies by its evaluation function — but for some games no good hand-crafted evaluation exists. Go is the poster child: its branching factor is around 250 (versus chess’s ~35), and capturing what makes a Go position “good” in a formula has defeated decades of expert effort. Monte Carlo Tree Search (MCTS) sidesteps both problems at once. Rather than scoring a position with a formula, it estimates a move’s value by playing out many fast, semi-random games to the very end and averaging who won. And rather than searching uniformly, it grows its tree asymmetrically, pouring effort into the lines that look promising and barely touching the rest. This is the search algorithm that, fused with deep learning, became the heart of AlphaGo.

Every iteration runs the same four phases:

flowchart LR
    S["Selection<br/>walk down via UCT"] --> E["Expansion<br/>add a new child"]
    E --> R["Simulation<br/>random rollout to game end"]
    R --> B["Backpropagation<br/>update wins/visits up the path"]
    B --> S

Selection — starting at the root, repeatedly descend to a child using a rule that balances exploiting moves that have done well against exploring moves that have barely been tried, until you reach a node that still has unexpanded children.
Expansion — add one new child node, for one previously untried move.
Simulation (rollout) — from that new node, play a quick, often-random game all the way to a terminal win or loss.
Backpropagation — carry the result back up the path you came down, incrementing each node’s visit count and adding to its win count.

After a fixed budget of iterations — say 10,000 — you do not pick the move with the best average; you pick the root child with the most visits, the move the search trusted enough to investigate most thoroughly.

The doodle below captures MCTS’s signature asymmetry: the promising branch (left) keeps deepening as visits pour into it, while the unpromising one (right) is barely touched.

The cleverness is concentrated in the selection rule, UCT (Upper Confidence bounds applied to Trees). For each child you compute:

\[ \text{UCT} = \underbrace{\frac{w_i}{n_i}}_{\text{exploit: win rate}} + \; c\,\underbrace{\sqrt{\frac{\ln N}{n_i}}}_{\text{explore: uncertainty}} \]

In words: prefer a move that has been winning a lot, plus a bonus for moves you’ve barely tried — so nothing promising goes unexamined just because it was unlucky early.

Also written: \(\text{UCT}_i = \hat{\mu}_i + c\sqrt{\ln N / n_i}\), where \(\hat{\mu}_i = w_i/n_i\) is the empirical win rate — this is the UCB1 bandit formula (Chapter 25) applied at every tree node.

Here \(w_i\) is the number of wins recorded through child \(i\), \(n_i\) is how many times child \(i\) has been visited, \(N\) is how many times the parent has been visited, and \(c\) is an exploration constant (a common choice is \(\sqrt{2}\)). The first term rewards moves with a strong track record. The second term inflates the score of moves that have been tried only a few times (small \(n_i\)), guaranteeing every move gets its day in court. As a move accumulates visits its exploration bonus steadily shrinks, so the search’s attention drifts naturally toward the genuinely strongest options.

The doodle below shows the two terms tugging on a move’s score in opposite directions as visits pile up — exploit climbs toward the true win rate, explore decays away:

Worked example. Suppose a parent node has been visited \(N = 100\) times, and we are deciding between two children, with \(c = 1.4\):

Move	wins \(w_i\)	visits \(n_i\)	win rate \(w_i/n_i\)	bonus \(c\sqrt{\ln N / n_i}\)	UCT
A	30	40	0.750	\(1.4\sqrt{4.605/40}=0.475\)	1.225
B	8	10	0.800	\(1.4\sqrt{4.605/10}=0.950\)	1.750

Move B wins the comparison despite having been explored far less — its larger uncertainty bonus is the algorithm saying, in effect, “we have only ten data points on B; go gather more before judging it.” That is UCT deliberately probing the under-explored move so its estimate can sharpen. Over many iterations the bonus on B shrinks as \(n_i\) climbs, and whichever move is genuinely stronger eventually rises to the top.

import math
def uct(child, parent_visits, c=1.4):
    if child.visits == 0:
        return float('inf')                # always try an unvisited move first
    exploit = child.wins / child.visits
    explore = c * math.sqrt(math.log(parent_visits) / child.visits)
    return exploit + explore

# selection: from a node, repeatedly descend to the highest-UCT child
def select(node):
    while node.children and node.fully_expanded():
        node = max(node.children, key=lambda ch: uct(ch, node.visits))
    return node

Why MCTS beats minimax for Go comes down to three properties. It needs no domain-specific evaluation function, because the random rollouts supply the value signal directly. It is anytime: you can stop the moment you run out of thinking time and immediately return the best move found so far, with quality that degrades gracefully. And it concentrates computation on the promising lines instead of searching every branch uniformly. AlphaGo’s breakthrough was to replace the two weakest pieces — purely random rollouts and uniform expansion — with neural networks: a policy network that biases selection toward strong moves, and a value network that evaluates positions without needing a full rollout. That marriage of MCTS with deep learning (Chapters 14–15) and reinforcement learning (Chapter 25) is what cracked Go.

In AlphaZero, the UCT formula is replaced by a PUCT variant that folds in the policy network’s prior \(P(i)\) for each move:

\[ \text{PUCT}_i = \frac{w_i}{n_i} + c\, P(i)\,\frac{\sqrt{N}}{1 + n_i} \]

In words: same explore/exploit trade-off, but the exploration bonus is steered toward moves the neural net already thinks are good, so search wastes far fewer rollouts on obvious junk.

Also written: \(\text{PUCT}_i = Q_i + c\,P(i)\frac{\sqrt{N}}{1+n_i}\), with \(Q_i\) the mean action value — the same shape as UCT, with the prior \(P(i)\) multiplying the exploration term.

Tip

Intuition: minimax assumes a flawless adversary and reasons exhaustively; MCTS gambles, sampling thousands of quick games and trusting the law of large numbers to surface the best move. Tune the exploration constant \(c\) up to explore more (helpful early, when win-rate estimates are noisy and you don’t want to commit) and down to exploit more (helpful late, when you trust your accumulated statistics).

34.5 — Backtracking search: depth-first assignment with heuristics

A constraint satisfaction problem (CSP) hands you a set of variables, a domain of allowed values for each, and a set of constraints saying which combinations are legal. The job is to assign every variable a value so that no constraint is violated. The brute-force instinct is to try every full assignment and check it, but that explodes: \(n\) variables with \(d\) values each give \(d^n\) combinations. Backtracking search is the simple fix — build the assignment one variable at a time, and the instant a partial assignment breaks a constraint, abandon that whole branch and back up.

The intuition is filling in a form where some answers must agree. You write one entry, check it against what you’ve already written, and if it clashes you erase it and try the next option rather than blindly finishing the form. Because a violated constraint on three filled-in variables rules out every way of filling the rest, pruning early saves the entire subtree below.

flowchart TD
    A["assignment so far"] --> B{"complete?"}
    B -->|yes| C["return solution"]
    B -->|no| D["pick an unassigned variable"]
    D --> E["try a value from its domain"]
    E --> F{"consistent with<br/>constraints?"}
    F -->|yes| G["recurse deeper"]
    F -->|no| H["try next value"]
    G --> I{"succeeded?"}
    I -->|yes| C
    I -->|no| H
    H --> J{"values left?"}
    J -->|yes| E
    J -->|no| K["backtrack: undo,<br/>fail upward"]

The skeleton is a depth-first recursion: choose a variable, loop over its values, keep the consistent ones, recurse, and undo on failure.

def backtrack(assignment, csp):
    if len(assignment) == len(csp.vars):       # all assigned
        return assignment
    var = select_unassigned(assignment, csp)   # heuristic choice
    for val in order_values(var, assignment, csp):
        if consistent(var, val, assignment, csp):
            assignment[var] = val
            result = backtrack(assignment, csp)
            if result is not None:
                return result
            del assignment[var]                # undo, try next
    return None                                # trigger backtrack

Plain backtracking already beats brute force, but two free choices govern its speed: which variable to assign next and which value to try first. Those choices are where the heuristics live, and they cost almost nothing to compute.

Tip

Backtracking commits to one variable’s value and explores fully before reconsidering. Order doesn’t change whether a solution is found — only how much of the tree you walk before finding it. Good ordering can turn an intractable search into an instant one.

34.6 — Variable and value ordering: MRV, degree, and least-constraining-value

If you must guess, guess where you have the least freedom — that is the whole idea behind variable ordering. The minimum-remaining-values heuristic (MRV, also called “most-constrained variable” or “fail-first”) picks the unassigned variable with the fewest legal values left. The reasoning is blunt: a variable down to one or two options is a near-forced move, so commit to it now. If it has no legal values, you discover the dead end immediately instead of after pointlessly assigning other variables first.

When several variables tie on MRV — common at the very start, when every domain is still full — the degree heuristic breaks the tie by choosing the variable involved in the most constraints with other unassigned variables. Pinning down the most entangled variable first prunes the most branches downstream.

Value ordering pulls in the opposite direction. Once you’ve picked a variable, the least-constraining-value heuristic tries the value that rules out the fewest options for the neighboring variables. Here you are not failing fast — you want to succeed, and you are betting this branch contains a solution, so you keep your future options as open as possible.

The asymmetry is worth pausing on: pick the variable most likely to fail (so failures surface early), but pick the value least likely to fail (so you don’t needlessly destroy a workable branch).

Choice	Heuristic	Goal	Rule
Which variable?	MRV	fail fast	fewest legal values left
Tie-break variable	Degree	prune most	most constraints on unassigned neighbors
Which value?	LCV	succeed	leaves most options open for neighbors

Consider coloring a map of Australia with three colors — red, green, blue — where adjacent regions must differ. The regions are WA, NT, SA, Q, NSW, V, and the island T (with no land neighbors).

SA touches WA, NT, Q, NSW, and V — five neighbors — so the degree heuristic flags it first. Assign SA = red. Now every neighbor of SA has lost red from its domain, so all five drop to two values \(\{\)green, blue\(\}\) and T still has three. MRV now prefers any of those five over T. Say we take WA = green; its neighbor NT can no longer be green, leaving NT = blue, which forces Q = green, then NSW = blue, then V = green. T, constrained by nothing, takes any color. The map is solved with zero backtracking — the heuristics steered us straight to a solution.

Warning

These heuristics are cheap but not free. MRV and LCV both scan neighbor domains on every step. For small CSPs the bookkeeping can outweigh the savings; their payoff grows with problem size and constraint density. Profile before assuming they help on a tiny instance.

34.7 — Constraint propagation and arc consistency (AC-3)

Heuristics choose better; propagation lets you deduce. Constraint propagation uses the constraints to shrink domains before and during search — turning “this might be illegal” into “this is provably impossible, delete it.” The most useful form is arc consistency.

An arc is a directed pair of variables \((X, Y)\) sharing a constraint. The arc is consistent when, for every value remaining in \(X\)’s domain, there exists some value in \(Y\)’s domain that satisfies the constraint. If some value \(x\) in \(X\) has no compatible partner in \(Y\), then \(x\) can never be part of a solution, so we delete it. Deleting from \(X\) may in turn break the arcs pointing into \(X\) from its other neighbors, so those must be rechecked — propagation cascades.

AC-3 is the standard algorithm that runs this to a fixpoint. It keeps a queue of arcs; for each arc it revises \(X\)’s domain against \(Y\), and whenever a revision removes anything, it re-enqueues every arc \((Z, X)\) for \(Z\) a neighbor of \(X\).

def ac3(csp):
    queue = [(x, y) for x in csp.vars for y in csp.neighbors[x]]
    while queue:
        x, y = queue.pop(0)
        if revise(csp, x, y):                  # domain of x shrank
            if not csp.domain[x]:              # wiped out -> no solution
                return False
            for z in csp.neighbors[x] - {y}:   # recheck arcs into x
                queue.append((z, x))
    return True

def revise(csp, x, y):
    removed = False
    for vx in list(csp.domain[x]):
        # keep vx only if SOME vy satisfies the constraint
        if not any(csp.allowed(x, vx, y, vy) for vy in csp.domain[y]):
            csp.domain[x].remove(vx)
            removed = True
    return removed

A worked Sudoku slice shows the cascade. Take three cells in a row, \(A, B, C\), all needing different digits. Suppose propagation from elsewhere has already reduced their domains to \(A=\{1,2\}\), \(B=\{2\}\), \(C=\{1,2,3\}\). Because \(B\) is fixed at 2, the arc \((A, B)\) deletes 2 from \(A\) — no partner left — giving \(A=\{1\}\). That revision re-enqueues arcs into \(A\); the arc \((C, A)\) now deletes 1 from \(C\) (and \((C,B)\) deletes 2), leaving \(C=\{3\}\). The slice is fully solved by pure deduction, no guessing.

flowchart LR
    A0["A = {1,2}"] -->|"(A,B): drop 2"| A1["A = {1}"]
    B0["B = {2}"] --> A1
    A1 -->|"(C,A): drop 1"| C1
    B0 -->|"(C,B): drop 2"| C1["C = {3}"]
    C0["C = {1,2,3}"] --> C1

Run AC-3 once as preprocessing and it often prunes domains dramatically before search even begins. The catch: arc consistency is sound but incomplete. It can certify that individual arcs are satisfiable yet miss higher-order contradictions — a graph can be fully arc-consistent and still have no solution (3-coloring a triangle leaves every domain at all three colors, yet no assignment works). Propagation prunes; it does not generally solve.

The cost of AC-3 is bounded: with \(e\) arcs (binary constraints) and domain size \(d\),

\[ T_{\text{AC-3}} = O(e\, d^3) \]

In words: each arc can be re-checked at most \(d\) times (once per value its neighbor loses), and each check scans up to \(d^2\) value pairs — so the work grows roughly with the number of constraints times the cube of the domain size.

Also written: \(O(n^2 d^3)\) in the dense case where every pair of the \(n\) variables is constrained, since then \(e = O(n^2)\).

Warning

Common mistake: treating a successful AC-3 run as “the CSP is solved.” It is not. Arc consistency only deletes values that provably belong in no solution; a problem can be fully arc-consistent and still be unsatisfiable (the triangle 3-coloring above is the canonical trap). AC-3 is a powerful pruner you run before and during search — never a standalone solver.

34.8 — Forward checking and local search (min-conflicts)

Forward checking is the lightweight cousin of full propagation, run inside backtracking search. Each time you assign a variable a value, you immediately delete that value from the domains of its still-unassigned neighbors. The moment any neighbor’s domain becomes empty, you know this branch is doomed and backtrack at once — without descending further. It is cheaper than AC-3 (it only checks arcs pointing at the just-assigned variable, not the full cascade) and pairs naturally with MRV: forward checking is what updates the remaining-values counts that MRV reads.

flowchart LR
    A["assign X = v"] --> B["delete v from<br/>each unassigned<br/>neighbor's domain"]
    B --> C{"any domain<br/>empty?"}
    C -->|yes| D["backtrack now"]
    C -->|no| E["continue search<br/>with smaller domains"]

In the Australia map, assigning SA = red and forward-checking immediately strikes red from WA, NT, Q, NSW, and V. If a later assignment ever empties one of those domains, search reverses on the spot rather than blundering deeper. Forward checking catches failures one step earlier than plain consistency-checking but later than full arc consistency — a deliberate middle ground on the cost/pruning curve.

For very large CSPs — think scheduling thousands of exams, or an \(n\)-queens board with \(n\) in the millions — building a search tree at all is wasteful. Local search with the min-conflicts heuristic throws out backtracking entirely. Start from a complete assignment (every variable has a value, constraints be damned), then repeatedly pick a variable that is currently in conflict and reassign it to the value that minimizes the total number of violated constraints. Repeat until zero conflicts remain.

The min-conflicts step is a tiny argmin:

\[ \text{value}(X) \leftarrow \arg\min_{v \in D_X}\ \text{Conflicts}(X = v,\ \text{rest of assignment}) \]

In words: for the conflicted variable, count how many constraints each candidate value would break and pick the value that breaks the fewest.

Also written: \(v^\* = \arg\min_v \sum_{C} \mathbb{1}[\,C \text{ violated when } X=v\,]\) — minimize the number of violated constraints, ties broken at random.

def min_conflicts(csp, max_steps=10000):
    current = {v: random_value(v) for v in csp.vars}   # full random start
    for _ in range(max_steps):
        conflicted = [v for v in csp.vars if conflicts(v, current, csp) > 0]
        if not conflicted:                             # no violations: solved
            return current
        var = random.choice(conflicted)                # a variable in conflict
        # value that violates the fewest constraints (ties broken randomly)
        current[var] = min(csp.domain[var],
                           key=lambda val: conflicts_if(var, val, current, csp))
    return None                                        # gave up

The classic demonstration is \(n\)-queens. Place eight queens, one per column, and count attacking pairs. Pick a column whose queen is attacked, slide it to the row with the fewest attacks, repeat. Min-conflicts solves the million-queens problem in roughly fifty steps on average — a result that feels impossible until you see that most random starting positions are only a few moves from a solution, and the heuristic walks straight downhill toward it.

Warning

Min-conflicts is hill-climbing, so it can stall on a plateau or in a local minimum where every single-variable move increases or ties the conflict count. The standard escapes are random restarts (begin again from a fresh assignment) and random sideways moves or noise (occasionally accept an equal-conflict move to drift off a plateau). It is also incomplete: it cannot prove a CSP unsatisfiable — it just runs until the step budget expires.

The practical takeaway is a layered toolkit: backtracking with MRV, degree, and LCV for systematic search; forward checking and AC-3 to prune domains as you go; and min-conflicts local search when the problem is too large to enumerate but solutions are dense. Real solvers mix them — propagate to shrink, branch with heuristics, and fall back to local search at scale.

Which search method should I use?

flowchart TD
    A{"Problem type?"} -->|"reach a goal via<br/>a sequence of actions"| B["Planning:<br/>forward search + relaxed-plan<br/>heuristic, or SAT/GraphPlan"]
    A -->|"assign values<br/>under constraints"| C{"Too large to<br/>search a tree?"}
    A -->|"play vs an opponent"| D{"Good cheap<br/>evaluation function?"}
    C -->|"no"| E["Backtracking + MRV/degree/LCV,<br/>maintain AC-3 (MAC)"]
    C -->|"yes, but solutions dense"| F["Local search:<br/>min-conflicts"]
    C -->|"pure Boolean / huge"| G["Off-the-shelf SAT (CDCL)<br/>or CP-SAT / Z3"]
    D -->|"yes (e.g. chess)"| H["Minimax + alpha-beta<br/>+ quiescence"]
    D -->|"no (e.g. Go)"| I["MCTS / UCT<br/>(+ neural nets → AlphaZero)"]

34.9 — Quick reference

Term / method	What it means	When / why to use it
STRIPS action	Preconditions + add list + delete list — a “diff” on the set of true facts	Modeling planning domains; `Result(s,a)=(s∖Del)∪Add` when `Pre⊆s`
PDDL	Standard text format: a domain file (action schemas) + a problem file (objects, init, goal)	Feeding planners like Fast Downward; reuse one domain across many problems
Forward vs backward	Progression from the start state vs regression from the goal	Forward (concrete states, easy heuristics) dominates in practice
Relaxed-plan heuristic	Drop all delete lists, solve the easy problem, use its length as a distance estimate	Admissible guide for A*-style forward planners
Frame problem	Anything not in an action’s add/delete list is assumed unchanged	Under-specify an effect and you build impossible plans — always ask “what stops being true?”
GraphPlan / SATPlan	Layered planning graph with mutexes / compile a length-\(k\) plan into one CNF formula	Let propagation and SAT solvers do the combinatorics
CSP	Variables + domains + constraints; find an assignment violating none	Scheduling, timetabling, coloring, configuration — order doesn’t matter
Backtracking	DFS over partial assignments; undo on a dead end	Baseline CSP solver; pairs with ordering heuristics
MRV / degree / LCV	Fewest legal values first / most-constrained tie-break / value ruling out fewest neighbor options	Pick the variable likely to fail fast, the value likely to succeed
AC-3 (arc consistency)	Delete domain values with no supporting partner; cascade; \(O(e\,d^3)\)	Prune before/during search (MAC); sound but incomplete
Forward checking	On assignment, strike the value from unassigned neighbors’ domains	Cheaper than AC-3; feeds the counts MRV reads
Min-conflicts	Start complete, repeatedly move a conflicted variable to its fewest-conflict value	Huge CSPs with dense solutions (million-queens in ~50 steps); incomplete
SAT / CDCL	Boolean satisfiability in CNF; conflict-driven clause learning	NP-complete archetype; solvers (MiniSat, Glucose, Z3) handle millions of vars
Minimax	MAX maximizes, MIN minimizes, backed up the game tree	Two-player zero-sum perfect-information games
Alpha-beta	Prune branches that can’t change the result; \(O(b^{d/2})\) with good ordering	Same value as minimax, roughly doubles search depth
Evaluation + quiescence	Estimate non-terminal positions; keep searching forcing moves to beat the horizon effect	Depth-limited engines (chess); learned weights = NNUE
Expectimax	Chance nodes take the probability-weighted average of children	Games with randomness (dice, shuffled decks); needs meaningful magnitudes
MCTS / UCT	Random rollouts + \(\frac{w_i}{n_i}+c\sqrt{\ln N/n_i}\) explore/exploit selection	No eval function, anytime; huge branching (Go); add nets → AlphaZero PUCT

34.10 — Key takeaways

Planning models the world as states + actions (STRIPS/PDDL: preconditions, add list, delete list) and searches for an action sequence from the initial state to the goal. Forward search keeps states concrete; backward search stays goal-focused; partial-order and HTN methods shrink the space by committing late and planning hierarchically.
The frame problem is the classic planning trap: everything not in an action’s delete list is assumed unchanged, so a missing effect silently produces impossible plans.
GraphPlan (layered planning graph with mutexes) and SATPlan (compile a bounded plan into one Boolean formula) reframe planning so that propagation and SAT solvers do the heavy lifting — and supply the heuristics modern forward planners run on.
CSPs are defined by variables, domains, and constraints; backtracking (with MRV/LCV ordering) plus constraint propagation (arc-consistency / AC-3, maintained as MAC) prunes the search dramatically. Map-coloring, Sudoku, and SAT are all CSPs.
SAT is the NP-complete archetype; modern CDCL solvers with clause learning handle millions of variables and serve as a practical back-end for many constraint and even planning problems. In practice you reach for OR-Tools, PySAT, or Z3 rather than rolling your own.
Minimax solves perfect-information zero-sum games by assuming an optimal opponent; alpha-beta pruning returns the identical value while skipping provably irrelevant branches, roughly doubling reachable depth with good move ordering. Expectimax handles chance, and evaluation functions (plus quiescence search to beat the horizon effect) make deep games tractable.
MCTS/UCT replaces hand-crafted evaluation with random rollouts and a principled explore/exploit rule, is anytime and domain-light, and — fused with neural networks via the PUCT rule — became the engine of AlphaGo and AlphaZero.

34.11 — See also

Chapter 32 — Search & Problem Solving — the uninformed and heuristic search (BFS, DFS, A*) that planning and CSP backtracking build directly upon.
Chapter 33 — Knowledge Representation & Reasoning — the logic and symbolic facts that STRIPS states and SAT clauses encode.
Chapter 35 — Evolutionary Computation & Metaheuristics — alternative search strategies for the same hard combinatorial problems.
Chapter 25 — Reinforcement Learning — the value-estimation and exploration ideas behind UCT, and the RL that trained AlphaGo’s networks.
Chapter 03 — Optimization — the broader view of searching for optima that CSP and game search specialize.
Chapters 14–15 — Neural Networks & CNNs — the policy and value networks that modern MCTS systems plug into.

↪ The thread continues → Chapter 35 · 🧬 Evolutionary Computation & Metaheuristics

Classical search is exact but brittle. When the space is vast and gradients vanish, we borrow from nature — evolution and swarms — to optimize by selection rather than calculus.

📖 All chapters | ← 33 · 📖 Knowledge Representation & Reasoning | 35 · 🧬 Evolutionary Computation & Metaheuristics →