flowchart LR
W((Physical<br/>world)) -->|sensors| P[Perceive<br/>pixels → features]
P --> E[Estimate<br/>features → belief]
E --> PL[Plan<br/>belief → actions]
PL --> A[Act<br/>actions → torques]
A -->|motion| W
E -.->|state feedback| A
style W fill:#1a2a3a,stroke:#4a90d9,color:#fff
Chapter 41 — 🤖 Robotics & Autonomy
📖 All chapters | ← 40 · 🔗 Graph Machine Learning | 42 · 📐 Learning Theory →
📚 Jump to any chapter
🧮 Mathematical Foundations
- 01 · 🧮 Linear Algebra
- 02 · ∂ Calculus & Differentiation
- 03 · 📉 Optimization
- 04 · 🎲 Probability & Statistics
🧭 The ML Workflow
🧩 Classical Machine Learning
- 08 · 📈 Regression
- 09 · 📐 Classification Algorithms
- 10 · 🌳 Ensemble Methods
- 11 · 🔮 Clustering & Unsupervised Learning
- 12 · 🎯 Model Evaluation & Tuning
🎲 Probabilistic Models
🧠 Deep Learning
- 14 · 🧠 Neural Networks (Core)
- 15 · 🖼️ Convolutional Neural Networks
- 16 · 🔁 Recurrent & Sequence Models
- 17 · ⚡ Attention & Transformers
- 18 · 🎨 Generative Models
🗣️ Applied AI: Vision, Language, Audio & Time
- 19 · 👁️ Computer Vision
- 20 · 💬 Natural Language Processing
- 21 · 🔊 Speech & Audio Processing
- 22 · ⏳ Time Series & Forecasting
- 23 · 📚 Large Language Models
- 24 · 🌈 Multimodal AI
🕹️ Reinforcement Learning
🛠️ Applied ML Systems & Industries
🚀 Production, Tooling & Infrastructure
📚 Classical & Symbolic AI
- 32 · 🧭 Search & Problem Solving
- 33 · 📖 Knowledge Representation & Reasoning
- 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
- 35 · 🧬 Evolutionary Computation & Metaheuristics
⚖️ Responsible AI & Frontier
- 36 · 🔍 Explainable AI & Interpretability
- 37 · 🧷 Causal Inference
- 38 · ⚖️ AI Ethics, Fairness & Safety
- 39 · 🌠 Frontier & Emerging Directions
🎓 Advanced & Specialized Topics
- 40 · 🔗 Graph Machine Learning
- 41 · 🤖 Robotics & Autonomy
- 42 · 📐 Learning Theory
- 43 · 🔎 Information Retrieval & Data Mining
- 44 · 🏗️ LLM Systems: Building LLMs from Scratch
🎚️ Post-Training & Fine-Tuning
- 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
- 46 · 🏅 Post-Training II — Alignment & Evaluation
🚢 Model Serving & Deployment
A robot is a machine that senses, decides, and acts in a physical world that never quite matches the model in its head. The hard part is not the motors or the math in isolation — it is closing the loop between them fast enough, and reliably enough, that the robot does the right thing while sensors lie and gears slip. This chapter walks the autonomy stack from raw sensing to motor commands, leaning on intuition first and the precise equations second.
🧭 In context: autonomous systems — mobile robots, manipulators, self-driving — all run the same loop: perceive → estimate → plan → act, closed continuously under uncertainty. Perception leans on computer vision (Ch 16–18), estimation on probabilistic models (Ch 12), control and learning on RL & control (Ch 25), and SLAM back-ends on optimization (Ch 06). This chapter is the glue that wires them into something that moves.
💡 Remember this: autonomy is one closed loop — perceive → estimate → plan → act — run forever under uncertainty, and a robot’s reliability comes from how fast and how surely it closes that loop, not from any single algorithm in it.
41.1 The autonomy stack: perceive → estimate → plan → act
The intuition is a loop, not a pipeline. A robot does not perceive once and then drive forever; it re-senses every few milliseconds and corrects. Think of walking across a dark room: you take a step, feel where you are, adjust, step again. Robots formalize that instinct into four repeating stages.
Perceive turns raw sensor bytes (camera pixels, LiDAR returns, wheel encoder ticks, IMU accelerations) into features — edges, obstacles, landmarks. Estimate fuses those noisy features over time into a belief about the world state: where am I, where is everything else, how fast am I going. Plan chooses a sequence of actions to reach a goal given that belief. Act sends torques and velocities to actuators — and immediately feeds the resulting motion back into perception.
The loop below runs forever; watch the active stage travel around it — that traveling pulse is the robot’s heartbeat, and a real system runs several of these nested at different speeds.
Each stage maps to later sections: perception leans on computer vision (and the CNN backbones behind detection and depth); estimation is §41.4–41.5; planning is §41.6–41.7; acting is §41.3. The loop runs at different rates — a low-level control loop at 1 kHz, a planner at 10 Hz, perception at 30 Hz — and a real system is really nested loops, a fast inner control loop wrapped by slow outer planning.
The single most useful robotics question is “what rate does this loop close at, and what is its latency?” A perfect plan delivered 200 ms late can be worse than a rough plan delivered now. Autonomy is a real-time discipline before it is an AI discipline.
Sensors and coordinate frames: the substrate everything sits on
Before any algorithm runs, the robot must answer two physical questions: what can I measure, and measured relative to what. Think of frames like the way you give directions — “two meters to my left” only means something once you know where “I” am facing. Robots make this explicit.
Each sensor has strengths and blind spots, and good autonomy fuses complementary ones rather than trusting any single sensor:
| Sensor | Measures | Strength | Weakness |
|---|---|---|---|
| Camera | light / color | rich, cheap, dense | no native depth, fails in dark/glare |
| LiDAR | range via laser ToF | precise 3-D geometry | expensive, sparse, struggles in rain/fog |
| Radar | range + velocity (Doppler) | works in weather, direct speed | low resolution |
| IMU | acceleration + angular rate | fast (kHz), no external dependence | drifts badly when integrated |
| Wheel encoder | wheel rotation | simple odometry | slips, drifts (§41.2) |
| GPS/GNSS | global position | absolute, no drift | ~meters, blocked indoors/urban canyons |
A coordinate frame is a local set of axes attached to something — the world, the robot body, the camera, a gripper. A pose is always of one frame, expressed in another: “the camera frame in the world frame.” The bookkeeping that relates them is a rigid transform from linear algebra — a rotation \(R\) plus a translation \(t\) — composed as you hop from frame to frame. A point \(p\) known in the camera frame becomes, in the world frame,
\[p^{\text{world}} = R^{\text{world}}_{\text{cam}}\, p^{\text{cam}} + t^{\text{world}}_{\text{cam}}.\]
In words: to move a point from the camera’s viewpoint into world coordinates, rotate it to align the axes, then shift it by where the camera sits. Also written: in homogeneous form one \(4\times4\) matrix does both, \(\tilde p^{\text{world}} = T^{\text{world}}_{\text{cam}}\, \tilde p^{\text{cam}}\) where \(T = \begin{bmatrix} R & t \\ 0 & 1 \end{bmatrix}\) and \(\tilde p = (x,y,z,1)\).
Why bother stuffing rotation and translation into one \(4\times4\) matrix? Because then chaining frames is just multiplication. Picture the frames as a family tree: world → base → arm → camera. To turn a point the camera sees into world coordinates, you walk up that tree, and each hop is one matrix. Multiply them in order and you get a single transform \(T^{\text{world}}_{\text{cam}} = T^{\text{world}}_{\text{base}}\, T^{\text{base}}_{\text{arm}}\, T^{\text{arm}}_{\text{cam}}\) that does the whole trip at once. Want the reverse trip (world → camera)? Just invert the matrix. That is the entire reason robotics code is full of \(4\times4\) matrices. ROS keeps this whole tree of frames updated live (the tf system), and getting one hop wrong — a flipped sign, a swapped pair — is the single most common robotics bug (§41.10).
41.2 Configuration space and kinematics
To plan motion you need coordinates for “what the robot is doing.” The configuration \(q\) is the minimal set of numbers that pins down every part of the robot. A point robot in a plane has \(q=(x,y)\); a planar mobile robot adds heading, \(q=(x,y,\theta)\); a 7-joint arm has \(q=(\theta_1,\dots,\theta_7)\). The set of all legal configurations is the configuration space (C-space), and its dimension is the robot’s number of degrees of freedom (DOF).
The key mental shift: plan in C-space, not in the world. A fat L-shaped robot squeezing through a doorway is hard to reason about as a shape sliding among shapes. But map each obstacle into C-space — inflating it by the robot’s footprint — and the robot becomes a single point navigating among “configuration-space obstacles.” Planning a point through free space is far easier (this is exactly what §41.7 exploits).
The doodle below shows that trick: a disk robot near a wall (left) becomes a point near a thickened wall (right), grown by the robot’s radius.
Kinematics relates joint/wheel coordinates to where things end up in the world, ignoring forces and mass (that’s dynamics). Two directions:
- Forward kinematics (FK): given \(q\), where is the end-effector / robot? Always a well-defined function, \(x = f(q)\).
- Inverse kinematics (IK): given a desired end-effector pose \(x\), what \(q\) achieves it? Often nonlinear, with zero, one, many, or infinitely many solutions.
The Jacobian: from joint speeds to tip speeds
There is a layer between FK and full dynamics that does the day-to-day work of controlling an arm: the Jacobian. Intuition first — FK tells you where the hand is for a given pose; the Jacobian tells you which way the hand drifts when you nudge each joint a little. It is the dictionary that translates joint-space velocities into workspace velocities, built from the partial derivatives of calculus.
\[\dot{x} = J(q)\,\dot{q}, \qquad J(q) = \frac{\partial f}{\partial q}.\]
In words: the end-effector’s velocity equals the Jacobian matrix times the joint velocities — each column of \(J\) is the tip motion caused by spinning one joint at unit speed. Also written: componentwise, \(\dot{x}_i = \sum_j \dfrac{\partial f_i}{\partial q_j}\,\dot{q}_j\).
Running it backwards gives a practical, iterative way to do inverse kinematics: to move the tip by a small desired step \(\Delta x\), solve \(\Delta q = J^{+}\Delta x\) (using the pseudo-inverse \(J^{+}\) for redundant arms), step, recompute \(J\), repeat. When \(J\) loses rank the arm is at a singularity — a stretched-straight elbow, say — where some tip directions need infinite joint speed; controllers must damp these or steer around them.
The differential-drive / unicycle model
Most wheeled robots (and, abstracted, most cars) are modeled as a unicycle: a body at \((x,y)\) pointing at angle \(\theta\), commanded by a forward speed \(v\) and turn rate \(\omega\). Its kinematics are
\[\dot{x} = v\cos\theta, \qquad \dot{y} = v\sin\theta, \qquad \dot{\theta} = \omega.\]
In words: the robot always moves in the direction it is facing at speed \(v\), and its heading turns at rate \(\omega\) — it can drive and spin, but never slide sideways. Also written: in vector form \(\dot{q} = \begin{bmatrix}\cos\theta & 0\\ \sin\theta & 0\\ 0 & 1\end{bmatrix}\begin{bmatrix}v\\ \omega\end{bmatrix}\), exposing the two control columns of this \(3\times2\) map.
A real differential-drive robot has two wheels of radius \(r\) a distance \(L\) apart, spinning at \(\omega_L, \omega_R\). Convert wheel speeds to body commands:
\[v = \frac{r(\omega_R + \omega_L)}{2}, \qquad \omega = \frac{r(\omega_R - \omega_L)}{L}.\]
In words: average the two wheel surface speeds to get forward speed; their difference, divided by the track width, gives turn rate — equal wheels drive straight, opposite wheels spin in place. Also written: \(\begin{bmatrix} v \\ \omega \end{bmatrix} = \dfrac{r}{2}\begin{bmatrix} 1 & 1 \\ 2/L & -2/L \end{bmatrix}\begin{bmatrix}\omega_R\\ \omega_L\end{bmatrix}\).
This model is nonholonomic: the robot cannot slide sideways. There are 3 configuration DOF \((x,y,\theta)\) but only 2 controls \((v,\omega)\). You can reach any pose, but not by any path — hence parallel parking takes a wiggle, not a sideways glide.
Worked example — dead reckoning. A differential-drive robot with \(r=0.05\) m wheels, \(L=0.3\) m, runs \(\omega_R=10\), \(\omega_L=8\) rad/s for \(0.5\) s starting at the origin facing \(\theta=0\).
\[v = \frac{0.05(10+8)}{2} = 0.45 \text{ m/s}, \qquad \omega = \frac{0.05(10-8)}{0.3} = 0.333 \text{ rad/s}.\]
After \(0.5\) s, \(\theta \approx 0.167\) rad. Integrating with the midpoint heading \(\approx 0.083\) rad: \(x \approx 0.45 \cdot 0.5 \cdot \cos(0.083) \approx 0.224\) m, \(y \approx 0.45 \cdot 0.5 \cdot \sin(0.083) \approx 0.0187\) m. Notice the error this accrues: integrating wheel speeds open-loop is odometry, and it drifts without bound — the motivation for §41.4.
41.3 Feedback control: PID, feedback linearization, flatness
Kinematics tells you how commands move the robot; control decides what commands to send so the robot tracks a target despite disturbances. The governing idea is feedback: measure the error between where you are and where you want to be, and drive it to zero.
PID — the workhorse
PID forms a command from three views of the error \(e(t) = \text{target} - \text{measured}\):
\[u(t) = K_p\, e(t) + K_i \int_0^t e(\tau)\,d\tau + K_d\, \frac{de(t)}{dt}.\]
In words: push in proportion to how far off you are now, plus a correction for error that has piled up over time, plus a brake that responds to how fast the error is changing. Also written: in discrete time, \(u_k = K_p e_k + K_i \sum_{j\le k} e_j\,\Delta t + K_d\,(e_k - e_{k-1})/\Delta t\) — exactly the code below.
- Proportional (\(K_p\)): push harder the farther off you are. Alone, it leaves a steady-state offset and can oscillate.
- Integral (\(K_i\)): accumulate past error to erase steady offset (e.g., gravity sag on an arm). Too much causes windup and overshoot.
- Derivative (\(K_d\)): react to how fast error is changing — damping that anticipates and reduces overshoot. Sensitive to noisy measurements.
A picture of the three roles: \(P\) is a spring pulling toward target, \(D\) is a shock absorber damping the swing, \(I\) slowly removes any leftover gap.
# from-scratch PID controller, one step
class PID:
def __init__(self, kp, ki, kd, dt):
self.kp, self.ki, self.kd, self.dt = kp, ki, kd, dt
self.int = 0.0; self.prev = 0.0
def step(self, target, measured):
e = target - measured
self.int += e * self.dt # accumulate
deriv = (e - self.prev) / self.dt # rate of change
self.prev = e
return self.kp*e + self.ki*self.int + self.kd*deriv
pid = PID(kp=2.0, ki=0.5, kd=0.1, dt=0.1)
pos = 0.0
for _ in range(3): # drive position toward 1.0
u = pid.step(1.0, pos)
pos += 0.1 * u # crude plant: integrates command
print(round(pos, 3)) # 0.2, 0.46, 0.74... climbing to targetIntegral windup is the classic PID failure: while the robot is saturated (motor maxed, arm blocked), the integral term keeps growing, then overshoots wildly when the obstruction clears. Always clamp the integral term, and consider disabling it during saturation.
A production controller adds the guards the from-scratch loop omits — output saturation, anti-windup clamping, and derivative-on-measurement to avoid kicks when the target jumps:
# idiomatic discrete PID with the guards real controllers need
import numpy as np
class PIDController:
def __init__(self, kp, ki, kd, dt, u_min=-1.0, u_max=1.0):
self.kp, self.ki, self.kd, self.dt = kp, ki, kd, dt
self.u_min, self.u_max = u_min, u_max
self.integral = 0.0
self.prev_meas = None
def step(self, target, measured):
e = target - measured
# derivative on measurement (not error) -> no spike on setpoint change
d = 0.0 if self.prev_meas is None else -(measured - self.prev_meas) / self.dt
self.prev_meas = measured
u_unsat = self.kp * e + self.ki * self.integral + self.kd * d
u = float(np.clip(u_unsat, self.u_min, self.u_max))
# anti-windup: only integrate when not saturated (back-calculation lite)
if u == u_unsat:
self.integral += e * self.dt
return u
ctrl = PIDController(kp=2.0, ki=0.5, kd=0.1, dt=0.1)
print(round(ctrl.step(1.0, 0.0), 3)) # first command toward targetWhen PID isn’t enough: feedback linearization and flatness
PID treats the system as roughly linear near the setpoint. Real robots are nonlinear — the unicycle’s \(\cos\theta\), an arm’s coupled inertias. Two ideas tame this:
Feedback linearization cancels the nonlinearity with the control law itself. If the dynamics are \(\dot{x} = f(x) + g(x)u\), choose \(u = g(x)^{-1}(-f(x) + v)\), leaving the clean linear system \(\dot{x} = v\), which you then control with PID or LQR — itself a small optimization over feedback gains. It is exact only if your model \(f, g\) is exact — model error leaks straight through.
In words: measure the messy nonlinear part of the dynamics and deliberately push back exactly enough to erase it, so what’s left behaves like a simple straight-line system you already know how to control. Also written: substituting the chosen \(u\) into \(\dot x = f(x)+g(x)u\) gives \(\dot x = f(x) - f(x) + v = v\).
Differential flatness is a structural gift some systems have: all states and inputs can be written as functions of a few flat outputs and their derivatives. A quadrotor is flat in \((x,y,z,\text{yaw})\) — pick a smooth trajectory for those four and every motor command is determined algebraically, no integration of dynamics needed. This is why aggressive drone-trajectory planners work: plan a smooth curve in flat space, recover controls in closed form.
For the optimal-control treatment of the linearized system — LQR, which picks the feedback gains that minimize a quadratic cost on error and effort — see reinforcement learning & control. PID is the hand-tuned cousin; LQR is its principled, model-based sibling.
Model-predictive control (MPC): planning a few steps ahead
PID reacts to the present; MPC looks ahead. The everyday analogy is driving: you don’t steer based only on where the car is now, you glance down the road, imagine the next few seconds, and pick the steering that keeps that whole imagined path on track — then you do it again a moment later with a fresh glance.
Formally, every control cycle MPC solves a short-horizon optimization, applies only the first command, then re-solves at the next step (receding horizon):
\[\min_{u_0,\dots,u_{N-1}} \;\sum_{k=0}^{N-1}\Big[ (x_k - x_k^{\text{ref}})^\top Q (x_k - x_k^{\text{ref}}) + u_k^\top R\, u_k \Big] \quad \text{s.t. } x_{k+1}=f(x_k,u_k),\; u_k\in\mathcal{U}.\]
In words: over the next \(N\) steps, pick the commands that keep the predicted state close to the reference while spending little control effort, respecting the physics and the actuator limits — then commit only to the first command and repeat. Also written: the per-step penalty \(\|x_k-x_k^{\text{ref}}\|_Q^2 + \|u_k\|_R^2\) uses the weighted-norm shorthand \(\|z\|_M^2 = z^\top M z\).
The animation makes the receding horizon idea concrete: at each tick the planner imagines a short path ahead (the faint horizon), commits only the first step, then slides the whole window forward and re-imagines.
MPC’s selling point is that it handles constraints natively — speed caps, lane boundaries, joint limits, obstacle clearance go straight into the optimization, which PID cannot do. Its cost is computational: an optimization problem solved every cycle. It is the workhorse of modern self-driving steering, legged-robot balance, and process control. It is also the fast replanning loop referenced in §41.7.
41.4 State estimation and the recursive Bayes filter
Odometry drifts (§41.2) and every sensor is noisy. State estimation answers “what is the state, given everything I’ve sensed?” not as a point but as a probability distribution — the belief \(bel(x_t)\). Maintaining a distribution, not a guess, is what lets a robot know how sure it is and fuse a fresh measurement with the right weight.
The recursive Bayes filter is the master algorithm behind Kalman and particle filters alike. Each cycle has two steps:
Predict (motion update): push the belief through the motion model. The robot acted, so its uncertainty grows.
\[\overline{bel}(x_t) = \int p(x_t \mid u_t, x_{t-1})\, bel(x_{t-1})\, dx_{t-1}\]
In words: to predict where you are now, take every place you might have been, ask where each would land given the action you just took, and add up those possibilities weighted by how likely each starting place was. Also written: as an expectation, \(\overline{bel}(x_t) = \mathbb{E}_{x_{t-1}\sim bel}\!\left[\,p(x_t \mid u_t, x_{t-1})\,\right]\).
Update (measurement correction): multiply by the likelihood of the observation, then renormalize. A measurement arrived, so uncertainty shrinks.
\[bel(x_t) = \eta\, p(z_t \mid x_t)\, \overline{bel}(x_t)\]
In words: reweight each candidate state by how well it explains the measurement you just got, then rescale so the weights sum to one. Also written: \(bel(x_t) \propto p(z_t \mid x_t)\,\overline{bel}(x_t)\), with \(\eta = 1/\!\int p(z_t\mid x_t)\,\overline{bel}(x_t)\,dx_t\).
The whole filter is one breathing rhythm: acting widens the belief, sensing narrows it. Watch the bell curve below pulse wide (predict) then snap sharp (update), forever.
flowchart LR
B["bel(x_{t-1})"] -->|"predict: apply motion u_t"| BP["bel-bar(x_t)<br/>wider, less certain"]
BP -->|"update: weigh by measurement z_t"| BN["bel(x_t)<br/>narrower, more certain"]
BN -.->|next step| B
Everything downstream is this loop with a specific representation of the belief: a Gaussian (Kalman), a Gaussian after local linearization (EKF/UKF), or a cloud of samples (particle filter). The two-step rhythm — act widens, sense narrows — never changes.
41.5 Kalman, EKF/UKF, and particle filters
The Kalman filter
When motion and measurement are linear and noise is Gaussian, the Bayes filter has a closed form: the belief stays Gaussian forever, summarized by a mean \(\hat{x}\) and covariance \(P\). The Kalman filter is the optimal estimator in that setting, and it is just the two Bayes steps written for Gaussians.
The magic number is the Kalman gain \(K\) — how much to trust the new measurement versus the prediction. High measurement noise → small \(K\) → mostly believe the prediction. High process noise → large \(K\) → mostly believe the sensor.
\[K = \frac{P}{P+R}, \qquad \hat{x} \leftarrow \hat{x} + K(z - \hat{x}), \qquad P \leftarrow (1-K)P.\]
In words: the gain is the prediction’s share of the total uncertainty; nudge the estimate toward the measurement by that fraction of the surprise, and shrink the uncertainty accordingly. Also written: in full matrix form \(K = P H^\top (H P H^\top + R)^{-1}\), with \(\hat{x} \leftarrow \hat{x} + K(z - H\hat{x})\) and \(P \leftarrow (I - KH)P\); the scalar case above is \(H=1\).
The picture below is the whole filter in one frame: a fuzzy prediction (blue) and a fuzzy measurement (amber) blend into a sharper fused belief (green) sitting between them — narrower than either input.
Worked 1-D example. Estimate a robot’s position. We believe \(\hat{x}=0\) m with variance \(P=4\) (std 2 m). The robot does not move (\(\hat{x}\) prediction stays 0), but motion adds process noise \(Q=1\), so after predict \(P = 4+1 = 5\). A GPS reads \(z=3\) m with measurement variance \(R=2\).
\[K = \frac{P}{P+R} = \frac{5}{5+2} = 0.714\] \[\hat{x} \leftarrow \hat{x} + K(z - \hat{x}) = 0 + 0.714(3-0) = 2.14 \text{ m}\] \[P \leftarrow (1-K)P = (1-0.714)\cdot 5 = 1.43\]
The estimate moved 71% of the way to the GPS reading, and variance dropped from 5 to 1.43 — fusing prior and measurement left us more certain than either alone. That variance reduction is the whole point.
# 1-D Kalman update, matches the worked numbers
def kalman_1d(x, P, z, Q, R, dx=0.0):
x = x + dx; P = P + Q # predict
K = P / (P + R) # gain
x = x + K * (z - x) # update mean
P = (1 - K) * P # update covariance
return x, P
x, P = kalman_1d(0.0, 4.0, 3.0, Q=1.0, R=2.0)
assert abs(x - 2.142) < 1e-2 and abs(P - 1.428) < 1e-2 # check
print(round(x, 2), round(P, 2)) # 2.14 1.43The multivariate filter is the same two steps in matrix form. In practice you reach for a library rather than re-deriving the algebra:
# multivariate Kalman filter with FilterPy: constant-velocity tracker
import numpy as np
from filterpy.kalman import KalmanFilter
kf = KalmanFilter(dim_x=2, dim_z=1) # state = [position, velocity]
dt = 1.0
kf.F = np.array([[1, dt], [0, 1]]) # motion: x += v*dt
kf.H = np.array([[1.0, 0.0]]) # we measure position only
kf.P *= 4.0 # initial covariance
kf.R = np.array([[2.0]]) # measurement noise
kf.Q = np.array([[1.0, 0.0], [0.0, 1.0]]) # process noise
for z in [3.0, 3.9, 5.2, 6.1]: # noisy position stream
kf.predict()
kf.update(z)
print(np.round(kf.x.ravel(), 2)) # fused position + inferred velocityEKF and UKF — coping with nonlinearity
Real motion (\(\cos\theta\)) and sensors (range = \(\sqrt{\Delta x^2+\Delta y^2}\)) are nonlinear, so the belief is no longer exactly Gaussian. Two pragmatic fixes keep the Kalman machinery:
- Extended Kalman Filter (EKF): linearize the models with a first-order Taylor expansion (Jacobians) around the current estimate, then apply the linear Kalman equations. Cheap and dominant in practice, but the linearization can diverge if the function curves sharply or uncertainty is large.
- Unscented Kalman Filter (UKF): skip Jacobians. Push a small set of deterministically chosen sigma points through the true nonlinear function and refit a Gaussian to where they land. More accurate for strong nonlinearity, no derivatives to derive, similar cost.
The particle filter
When the belief is genuinely non-Gaussian — multi-modal (“I’m at one of three identical doorways”), or arbitrarily shaped — represent it with a swarm of weighted samples (particles), each a hypothesis about the state. The loop: move every particle by the motion model (with noise), weight each by how well it explains the measurement, then resample — duplicate heavy particles, kill light ones. Survivors cluster on the truth.
flowchart LR
S[particles<br/>spread out] -->|move + noise| M[scattered<br/>hypotheses]
M -->|weight by p of z| Wt[some heavy,<br/>some light]
Wt -->|resample| C[cluster on<br/>likely states]
C -.->|next step| S
This is Monte Carlo localization, the standard way a robot finds itself on a known map from a uniform prior. It handles arbitrary distributions and nonlinearity, at the cost of needing many particles in high dimensions (the curse of dimensionality).
A minimal NumPy particle filter makes the three steps concrete:
# 1-D particle filter: localize from a noisy distance-to-landmark sensor
import numpy as np
rng = np.random.default_rng(0)
N, landmark = 2000, 10.0
px = rng.uniform(0, 20, N) # particles: spread over the corridor
w = np.ones(N) / N
def step(px, w, move, z, mn=0.3, sn=1.0):
px = px + move + rng.normal(0, mn, len(px)) # predict (move + noise)
pred = np.abs(landmark - px) # expected measurement
w = w * np.exp(-0.5 * ((z - pred) / sn) ** 2) # weight by likelihood
w += 1e-300; w /= w.sum() # normalize
idx = rng.choice(len(px), len(px), p=w) # resample
return px[idx], np.ones(len(px)) / len(px)
for z in [6.0, 5.1, 4.0, 3.2]: # robot approaching the landmark
px, w = step(px, w, move=1.0, z=z)
print(round(px.mean(), 2)) # converges near true position ~6| Filter | Belief shape | Nonlinear? | Cost | Use when |
|---|---|---|---|---|
| Kalman | single Gaussian | linear only | cheap | linear-Gaussian system |
| EKF | single Gaussian | yes (Jacobian) | cheap | mild nonlinearity, fast loop |
| UKF | single Gaussian | yes (sigma points) | moderate | stronger nonlinearity, no derivatives |
| Particle | arbitrary samples | yes | expensive | multi-modal, global localization |
Rule of thumb for picking a filter: reach for the plain Kalman filter first; only climb to EKF/UKF when a model is genuinely nonlinear, and only to a particle filter when the belief is genuinely multi-modal (you cannot summarize it with one mean and covariance). Particle filters are powerful but pay an exponential price in dimension — a global-localization tool, not a default.
41.6 SLAM: simultaneous localization and mapping
There is a chicken-and-egg trap in §41.5: to know where you are you need a map, but to build a map you need to know where you are. SLAM solves both at once — estimate the robot’s trajectory and the map from the same stream of sensor data.
The intuition is loop closure. A robot drives a long loop; odometry drift means the map’s start and end don’t line up even though the robot recognizes it has returned to a known place. SLAM detects “I’ve been here before” and snaps the whole trajectory taut to make the ends meet, redistributing the accumulated error backward across the path.
Modern SLAM is posed as a graph optimization (pose-graph / factor-graph): nodes are robot poses and landmarks, edges are noisy constraints (odometry between consecutive poses, observations of landmarks, loop closures). Solving means finding the configuration of all nodes that best satisfies all constraints — a big nonlinear least-squares problem, optimized with the techniques of optimization. Front-ends differ by sensor (visual SLAM from cameras, LiDAR SLAM from point clouds), but the back-end is this same graph squeezed to consistency.
The objective the back-end minimizes makes “best satisfies all constraints” precise:
\[\hat{X} = \arg\min_{X} \sum_{(i,j)\in\mathcal{E}} \big\lVert z_{ij} - h(x_i, x_j) \big\rVert^2_{\Omega_{ij}}\]
In words: choose the poses and landmarks \(X\) that make every predicted measurement match its observed value as closely as possible, weighting each constraint by how trustworthy that sensor reading is. Also written: with \(e_{ij} = z_{ij} - h(x_i,x_j)\), the weighted norm expands to \(\sum_{ij} e_{ij}^\top\,\Omega_{ij}\,e_{ij}\), where \(\Omega_{ij}\) is the information (inverse-covariance) matrix of edge \((i,j)\).
The reason loop closure matters so much: without it, error grows roughly with distance traveled. With a single good loop closure, error is bounded by the loop — one recognized landmark can correct hundreds of meters of drift. Recognizing places (place recognition) is therefore as critical to SLAM as the geometry.
41.7 Motion and path planning
Given a map and a goal, planning finds a collision-free path through C-space (§41.2). Two families dominate, split by how they handle continuous space.
Search on a graph: Dijkstra and A*
Discretize C-space into a grid (or roadmap) and run a graph search on it. Dijkstra expands outward from the start by cheapest-cost-so-far \(g(n)\), guaranteeing the shortest path but exploring blindly in all directions. A* adds a heuristic \(h(n)\) — an optimistic estimate of remaining cost (e.g., straight-line distance to goal) — and expands by \(f(n) = g(n) + h(n)\), steering the search toward the goal.
\[f(n) = g(n) + h(n).\]
In words: rank each candidate node by the cost already paid to reach it plus an optimistic guess of the cost still to go, and always expand the most promising one next. Also written: Dijkstra is the special case \(h(n)=0\) (no guess, expand by cost-so-far alone).
A* returns the optimal path whenever \(h\) is admissible (never overestimates true remaining cost). The straight-line (Euclidean) distance is the canonical admissible heuristic for spatial grids: a real path can only be longer than the crow-flies line. A better (larger but still admissible) heuristic explores fewer nodes.
The contrast below is the whole reason A* exists: Dijkstra (left) fans out in a fat blind disk; A*’s heuristic (right) pulls the explored set into a narrow corridor aimed at the goal — same optimal path, a fraction of the work.
Tiny worked example. Grid from S to G, \(h\) = Manhattan distance, unit step cost:
S . . # g(S)=0
. # . . at a node 2 right of S: g=2, h=3 → f=5
. # . G a detour node: g=4, h=5 → f=9 (expanded later)
A* pops the lowest-\(f\) node each step. The node with \(f=5\) is explored before the \(f=9\) detour, so A* drives toward G instead of fanning out like Dijkstra — same optimal path, far fewer expansions.
Worked example — heuristic saves work. On an open \(100 \times 100\) grid, a goal \(60\) cells straight ahead, no obstacles. Dijkstra has no sense of direction, so it expands cells in a growing disk around the start: reaching distance \(60\) touches on the order of \(\pi \cdot 60^2 \approx 11{,}000\) cells. A* with the (admissible) straight-line heuristic only expands cells along the corridor toward the goal — on an obstacle-free grid that is essentially the \(\sim\!60\) cells of the path itself plus a thin fringe, a few hundred at most. Same optimal length, roughly 20–50× fewer expansions. Add obstacles and the gap shrinks (the heuristic gets fooled by walls), but the lesson holds: a good heuristic is what makes search tractable at scale.
Sampling-based planning: PRM, RRT, RRT*
Grids explode in high dimensions — a 7-DOF arm has no usable grid. Sampling-based planners dodge this by randomly sampling configurations and connecting only the valid ones, never enumerating the whole space.
- PRM (Probabilistic Roadmap): sample many random configurations, connect nearby collision-free pairs into a graph, then run A* on it. Built once, queried many times — ideal for a fixed environment.
- RRT (Rapidly-exploring Random Tree): grow a tree from the start. Repeatedly sample a random point, find the nearest tree node, and extend a short step toward the sample. The tree rapidly fills space, biased toward unexplored regions. Single-query and fast, but the path is jagged and not optimal.
- RRT*: RRT with rewiring — when adding a node, reconnect nearby nodes through it if that lowers their cost. As samples → ∞, the path converges to optimal (asymptotically optimal), at extra computation per node.
The SVG shows an RRT tree exploring from the start (green), reaching toward the goal (red) while avoiding an obstacle:
| Planner | Space | Optimal? | Query | Best for |
|---|---|---|---|---|
| Dijkstra | discrete grid | yes | single | small grids, no goal heuristic |
| A* | discrete grid | yes (admissible \(h\)) | single | grids with a goal direction |
| PRM | continuous | asymptotically | multi | fixed maps, repeated queries |
| RRT | continuous | no | single | fast feasible path, high-DOF |
| RRT* | continuous | asymptotically | single | high-DOF when path quality matters |
A planned geometric path is rarely the final word. Two refinements usually follow. Trajectory optimization (e.g., CHOMP, TrajOpt) takes a rough path and bends it into a smooth, dynamically feasible trajectory by minimizing a cost that trades off smoothness against obstacle clearance — useful when the sampled path is jagged. And because the world moves, planning is wrapped in a fast replanning loop: a local planner (or a model-predictive controller, §41.3’s cousin) re-solves a short-horizon plan every control cycle as new obstacles appear, while the global plan updates less often. Plan globally, react locally.
Here is A* end-to-end on a small grid, the algorithm behind the worked example:
# A* on a 4-connected grid; returns the optimal path or None
import heapq
def astar(grid, start, goal): # grid: set of blocked (r,c) cells
def h(n): return abs(n[0]-goal[0]) + abs(n[1]-goal[1]) # Manhattan, admissible
open_pq = [(h(start), 0, start, None)] # (f, g, node, parent)
came, g_best = {}, {start: 0}
while open_pq:
f, g, n, parent = heapq.heappop(open_pq)
if n in came: continue
came[n] = parent
if n == goal:
path = [n]
while came[path[-1]] is not None: path.append(came[path[-1]])
return path[::-1]
for dr, dc in [(1,0),(-1,0),(0,1),(0,-1)]:
m = (n[0]+dr, n[1]+dc)
if m in grid: continue # blocked
ng = g + 1
if ng < g_best.get(m, 1e9):
g_best[m] = ng
heapq.heappush(open_pq, (ng + h(m), ng, m, n))
return None
blocked = {(1,1),(2,1),(0,3)} # matches the worked-example walls
print(astar(blocked, (0,0), (2,3))) # optimal path around the wall41.8 Grasping, manipulation, and the contact problem
Mobile robots avoid contact; manipulators depend on it. Grasping is where geometry meets physics: the question is not just “can the fingers touch the object” but “will the object stay put when I lift, twist, and the world pushes back.”
The central concept is force closure: a grasp achieves force closure if the contacts can resist any external wrench (force + torque) by applying suitable contact forces. Intuitively, the fingers can squeeze in enough independent directions that no push or twist can break the hold. A frictionless point contact resists only along its normal; add friction and each contact resists within a friction cone (forces angled up to \(\arctan\mu\) from the normal). Enough cones spanning all directions → force closure. A practical rule of thumb: in the plane, two hard-finger contacts with friction on opposite faces can suffice; in 3-D you generally need at least three or four well-placed contacts.
The friction-cone condition that each contact must obey is worth stating crisply:
\[\lVert f_t \rVert \le \mu\, f_n, \qquad f_n \ge 0.\]
In words: a contact can only push (never pull), and the sideways friction it provides can be at most \(\mu\) times how hard it presses — exceed that and the finger slips. Also written: equivalently the contact force vector must lie inside the cone \(\arctan(\lVert f_t\rVert / f_n) \le \arctan\mu\), i.e. within angle \(\arctan\mu\) of the surface normal.
A second axis is prehensile vs. non-prehensile manipulation. Prehensile means grasping — the hand fully constrains the object (a pick-and-place gripper). Non-prehensile means controlling an object without caging it: pushing, sliding, throwing, tilting, toppling. Sliding a heavy box across the floor, or a hockey-style flick, are non-prehensile — cheaper and sometimes the only option for objects too big or oddly shaped to grasp, but harder to control because the object can slip away. Much modern manipulation research blends the two: nudge an object to a graspable pose (non-prehensile), then pick it up (prehensile).
Worked example — will it slip? A two-finger gripper holds a \(0.5\) kg part (\(mg \approx 4.9\) N) by pinching it, each finger pressing with normal force \(f_n = 8\) N against a rubber-on-plastic contact, \(\mu = 0.4\). Each finger can supply up to \(\mu f_n = 0.4 \times 8 = 3.2\) N of vertical friction, and two fingers give \(6.4\) N — comfortably above the \(4.9\) N weight, so the part holds with a safety factor of about \(1.3\). Drop \(\mu\) to \(0.25\) (a greasy part) and the budget falls to \(2 \times 0.25 \times 8 = 4.0\) N \(< 4.9\) N: the part slides. This is why grasp planners squeeze harder than theory demands and prefer high-friction contacts.
Force closure is a geometric guarantee that assumes your friction coefficient, contact locations, and object model are correct. Real grasps fail because \(\mu\) is lower than assumed (a greasy part), the object is heavier or its mass off-center, or the fingers land a millimeter off. Plan grasps with margin, and prefer grasps that are robust to small pose errors over grasps that are theoretically optimal but brittle.
41.9 Learning for robotics: imitation, sim-to-real, RL
Hand-engineering every behavior does not scale to dexterous, contact-rich, or visually complex tasks. Learning fills the gap, in three complementary modes.
Imitation learning (IL) teaches from demonstrations. The simplest form, behavioral cloning, is plain supervised learning: record an expert’s (observation → action) pairs and fit a policy. Its notorious failure is covariate shift / compounding error — the policy makes a small mistake, drifts to states the expert never visited, and has no idea what to do, so errors snowball. Fixes include DAgger (query the expert on the policy’s own visited states) and richer demonstration coverage. The deeper RL/IL treatment lives in reinforcement learning; here the point is that demonstrations are often the cheapest way to specify a hard-to-reward behavior.
A behavioral-cloning policy is just a supervised regressor from observations to actions — a few lines in PyTorch:
# behavioral cloning: fit a policy net to (observation -> action) demos
import torch, torch.nn as nn
policy = nn.Sequential( # obs_dim=8 (e.g. joint state), act_dim=2
nn.Linear(8, 64), nn.ReLU(),
nn.Linear(64, 64), nn.ReLU(),
nn.Linear(64, 2))
opt = torch.optim.Adam(policy.parameters(), lr=1e-3)
loss_fn = nn.MSELoss() # regress continuous actions
# obs, act: tensors of expert demonstrations
for epoch in range(100):
opt.zero_grad()
loss = loss_fn(policy(obs), act) # match expert action at each state
loss.backward(); opt.step()
# at deploy time: action = policy(current_obs) -- but beware covariate shiftReinforcement learning for control learns a policy by trial and error against a reward (Chapter 25’s algorithms). In robotics, raw RL on hardware is usually impractical — millions of trials would destroy the robot — so it is run in simulation, which is fast, safe, and resettable.
Sim-to-real is then the central challenge: a policy trained in a perfect simulator meets the reality gap — unmodeled friction, sensor latency, slightly wrong masses — and fails on the real robot. The dominant remedy is domain randomization: during training, randomize physics and rendering parameters (friction, masses, lighting, textures, latencies) so widely that the real world looks like just another sample. A policy robust across thousands of simulated variations tends to transfer, because reality falls inside the training distribution.
flowchart LR
subgraph Sim["Simulation (cheap, fast)"]
DR[Domain randomization:<br/>vary mass, friction,<br/>lighting, latency] --> RL[Train policy by RL]
end
RL -->|deploy weights| Real[Real robot]
Real -.->|reality gap:<br/>collect failures| DR
style Sim fill:#1a2a3a,stroke:#4a90d9,color:#fff
A newer current worth naming: end-to-end and foundation-model policies. Instead of the hand-built perceive→estimate→plan→act stack, a single large network maps pixels (and language instructions) straight to actions — vision-language-action models — a multimodal policy trained on huge multi-robot demonstration datasets, aiming for one policy that generalizes across tasks and embodiments. They trade the interpretability and verifiability of the classical stack for breadth; in practice the two coexist, with a safety layer (§41.10) underneath either. Frontier directions cover this in depth.
The unifying theme: learning shifts effort from coding behaviors to specifying objectives and providing data/experience — powerful, but it inherits all of ML’s data and distribution-shift pitfalls, now with a robot that can physically break.
41.10 Safety, human interaction, and calibration realities
A robot exerts force in a world containing people. Three practical realities separate a demo from a deployable system.
Safety is layered, not a single check. A common pattern: a fast, simple, certifiable safety layer runs underneath the smart-but-fallible autonomy stack — emergency-stop on collision-imminent, speed and force limits, a watchdog that halts if the control loop misses its deadline. The principle is that the simple layer is the one you can verify and trust; the learned planner proposes, the safety monitor disposes. Formal tools like control-barrier functions and reachability analysis make “the robot cannot enter this unsafe set” a provable property rather than a hope.
The mental model is a referee sitting under the planner: the clever stack suggests a move, and the simple verified monitor either passes it through or vetoes it down to a safe fallback.
Human-robot interaction (HRI) adds that humans are part of the environment and they are reactive and hard to predict. A robot sharing a space must estimate human intent, signal its own (a self-driving car nudging forward to indicate it will go), and stay legible — moving so a person can read what it will do next. Predictability often beats raw efficiency: a slightly slower motion a human can anticipate is safer than a fast optimal one that surprises them.
Calibration is the unglamorous foundation everything rests on. Models assume the camera is exactly where the CAD says, the wheels are exactly \(r\), the IMU reads zero at rest — and reality disagrees. Sensors need intrinsic calibration (a camera’s lens distortion and focal length) and extrinsic calibration (the rigid transform between the camera and the arm base — the “hand-eye” calibration). Wheels wear, gears have backlash, an IMU has a bias that drifts with temperature, and a PWM servo driver often runs a few percent off its nominal frequency. The best estimator and planner in the world produce garbage if fed an uncalibrated transform.
The most common “the algorithm is broken” bug in real robotics is actually a calibration or units error — a transform with the wrong sign, degrees fed where radians were expected, a frame mounted upside-down, or a sensor timestamp that is 50 ms stale. Before blaming the filter or planner, verify frames, units, signs, and time synchronization. Leave a calibration knob exposed for every physical constant; the real world will need it tuned.
41.11 The ROS ecosystem and middleware
Almost no real robot is one monolithic program. The everyday analogy is a newsroom: many specialists (a camera reporter, a planning editor, a motor desk) work in parallel, and a wire service routes each story to whoever subscribes to that topic. Robotics middleware — overwhelmingly ROS 2 (Robot Operating System) — is that wire service.
The core abstraction is a graph of independent nodes that communicate without knowing about each other:
- Topics — anonymous publish/subscribe streams for continuous data (a camera node publishes images on
/camera/image; a perception node subscribes). Many-to-many, fire-and-forget. - Services — request/response for occasional queries (“what is the current map?”).
- Actions — long-running goals with feedback and the ability to cancel (“navigate to the kitchen,” streaming progress).
- Transforms (
tf2) — the live tree of coordinate frames from §41.1, so any node can ask “where is the gripper in the map frame, right now?”
flowchart LR
CAM[camera node] -->|/image| PER[perception node]
LID[lidar node] -->|/scan| SLAM[slam node]
PER -->|/obstacles| NAV[planner node]
SLAM -->|/map + tf| NAV
NAV -->|/cmd_vel| BASE[motor driver node]
style NAV fill:#6366f1,stroke:#6366f1,color:#fff
This decoupling is why robotics is composable: swap the LiDAR SLAM node for a visual one without touching the planner, record every topic to a log (rosbag) and replay it to debug offline, or run nodes across several machines. ROS 2 rides on DDS for real-time-capable, configurable-reliability transport. A minimal publisher shows the shape of node code:
# ROS 2 node publishing a velocity command (rclpy)
import rclpy
from rclpy.node import Node
from geometry_msgs.msg import Twist
class Driver(Node):
def __init__(self):
super().__init__("driver")
self.pub = self.create_publisher(Twist, "/cmd_vel", 10)
self.create_timer(0.1, self.tick) # 10 Hz control loop
def tick(self):
msg = Twist()
msg.linear.x = 0.2 # v = 0.2 m/s forward
msg.angular.z = 0.5 # omega = 0.5 rad/s turn
self.pub.publish(msg)
rclpy.init(); rclpy.spin(Driver())The takeaway: the algorithms in this chapter rarely ship as standalone scripts — they ship as nodes wired into a graph, and “which node publishes what, at what rate, in which frame” is the architecture you actually debug.
When a ROS system “freezes” or behaves erratically, suspect the plumbing before the algorithm: a topic nobody publishes (a subscriber waiting forever), a queue depth so small that messages are dropped under load, a node publishing in the wrong frame, or two nodes fighting over /cmd_vel. Tools like ros2 topic hz, ros2 topic echo, and rqt_graph tell you what is actually flowing — check that the graph matches the one in your head before touching any math.
41.1 — Quick reference
| Term / formula | Meaning in one line | When / why it matters |
|---|---|---|
| Perceive → estimate → plan → act | The four-stage autonomy loop, run continuously | The master pattern of every autonomous system (§41.1) |
| Rigid transform \(T=\begin{bmatrix}R & t\\0&1\end{bmatrix}\) | Pose of one frame in another; chains by matrix product | Relate camera ↔︎ body ↔︎ world; a wrong hop is the classic bug |
| Configuration \(q\) / C-space | Minimal coordinates of the whole robot; obstacles inflated by footprint | Plan as a point among grown obstacles, not a shape (§41.2) |
| Jacobian \(\dot x = J(q)\dot q\) | Maps joint speeds to tip speeds; columns = per-joint tip motion | Iterative IK via \(\Delta q = J^{+}\Delta x\); watch singularities |
| Unicycle \(\dot x=v\cos\theta,\ \dot y=v\sin\theta,\ \dot\theta=\omega\) | Wheeled-robot kinematics; can drive/turn, not slide | Nonholonomic: fewer controls than DOF, so not every path works |
| PID \(u=K_p e+K_i\!\int e+K_d\dot e\) | Feedback from present, accumulated, and changing error | The workhorse tracker; clamp integral to avoid windup (§41.3) |
| MPC (receding horizon) | Optimize \(N\) steps, apply the first, re-solve | Handles constraints (limits, clearance) PID cannot |
| Bayes filter (predict / update) | Belief widens on motion, narrows on measurement | The recursion behind every filter below (§41.4) |
| Kalman gain \(K=P/(P+R)\) | Fraction of the surprise to trust the measurement | Optimal fusion in the linear-Gaussian case |
| EKF / UKF | Linearize (Jacobian) or sigma-points for nonlinear models | Keeps Kalman machinery when motion/sensors curve |
| Particle filter | Belief as weighted samples: move, weight, resample | Multi-modal / global localization; costly in high dimensions |
| SLAM + loop closure | Estimate trajectory and map jointly; snap returns taut | One loop closure bounds otherwise-unbounded drift (§41.6) |
| A* \(f(n)=g(n)+h(n)\) | Cost-so-far plus admissible guess-to-go | Optimal grid path with far fewer expansions than Dijkstra |
| RRT / RRT* | Sample-and-extend tree; RRT* rewires toward optimal | High-DOF planning where grids explode (§41.7) |
| Force closure / friction cone \(\lVert f_t\rVert\le\mu f_n\) | Contacts span all disturbance directions; push-only, bounded friction | Whether a grasp holds under twist and load (§41.8) |
| Domain randomization | Vary sim physics/rendering so reality is “just another sample” | The standard sim-to-real transfer remedy (§41.9) |
| Safety layer | Simple, verified monitor vetoes the smart stack | Trust the verifiable layer; planner proposes, monitor disposes (§41.10) |
| ROS 2 node / topic / tf2 | Pub-sub graph of processes sharing a live frame tree | The architecture you actually debug (§41.11) |
41.2 — Key takeaways
- Autonomy is a closed loop — perceive → estimate → plan → act — run continuously and in nested loops at different rates; latency and loop-closure rate matter as much as algorithmic quality.
- Sensors are complementary (camera, LiDAR, radar, IMU, GPS), and everything is expressed in coordinate frames chained by rigid transforms; a wrong transform is the classic robotics bug.
- Plan in configuration space, where the robot is a point and obstacles are inflated by its footprint. Wheeled robots are nonholonomic: fewer controls than DOF, so not every path is achievable. The Jacobian links joint speeds to tip speeds and drives iterative IK.
- Kinematics maps commands to motion; control (PID, feedback linearization, flatness, MPC, and LQR from Ch 25) drives tracking error to zero despite disturbances. Watch for integral windup; MPC adds look-ahead and native constraints.
- State estimation maintains a belief, not a guess, via the recursive Bayes filter (predict widens, update narrows). The Kalman filter is its linear-Gaussian closed form; EKF/UKF handle nonlinearity; particle filters handle arbitrary, multi-modal beliefs.
- SLAM estimates trajectory and map together; loop closure bounds otherwise-unbounded odometry drift, and the back-end is a weighted nonlinear least-squares graph optimization.
- Search planners (Dijkstra, A* with an admissible heuristic) suit low-dimensional grids; sampling planners (PRM, RRT, RRT) scale to high-DOF arms, with RRT asymptotically optimal; trajectory optimization smooths the result and a fast replanning loop reacts to a changing world.
- Grasping hinges on force closure within friction cones (a contact can only push, with friction \(\le \mu f_n\)); manipulation spans prehensile (grasping) and non-prehensile (pushing, sliding) contact.
- Learning (imitation, RL in sim, sim-to-real with domain randomization, and emerging end-to-end vision-language-action policies) trades behavior-coding for objective-and-data specification, inheriting distribution-shift risks — now with hardware that can break.
- Safety is a simple verifiable layer under the smart stack; HRI demands legible, predictable motion; and calibration/units/frames/time-sync errors masquerade as algorithm bugs more often than not.
- Real systems are built as a graph of ROS 2 nodes communicating over topics, services, and actions — the algorithms ship as composable nodes, not standalone scripts.
41.3 — See also
- Chapter 06 — Optimization: nonlinear least-squares behind SLAM back-ends and bundle adjustment.
- Chapters 16–18 — Computer vision: the perception front-end (detection, depth, visual features) feeding estimation.
- Chapter 25 — Reinforcement learning & control: LQR, policy learning, and the full imitation/RL treatment pointed to here.
- Chapter 12 — Probabilistic models: Bayes filtering, Gaussians, and graphical-model foundations under Kalman and particle filters.
- Chapter 39 — Frontier & embodied AI: large-scale robot learning, foundation models for control, and embodied agents.
↪ The thread continues → Chapter 42 · 📐 Learning Theory
We’ve built and shipped a great deal on intuition; now theory asks why learning works at all — the generalization guarantees beneath the magic.
📖 All chapters | ← 40 · 🔗 Graph Machine Learning | 42 · 📐 Learning Theory →