Chapter 24 — 🌈 Multimodal AI

📖 All chapters | ← 23 · 📚 Large Language Models | 25 · 🕹️ Reinforcement Learning →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Most of the models in this book see one kind of thing: pixels, or text, or audio. Multimodal AI is the study of systems that handle several of these at once — describing a photo in words, answering questions about a chart, generating an image from a sentence, or transcribing the emotion in a voice. The trick that makes it all work is surprisingly simple: teach different kinds of data to live in the same space, so a model can compare a picture and a caption the way it would compare two words. This chapter sits at the top of the Applied AI stack — it stitches together the vision, language, and audio chapters into systems that reason across the senses.

🧭 In context: Applied AI · used for cross-modal search, captioning, visual question answering, text-to-image/video, voice assistants · the one key idea — map every modality into one shared embedding space, then learn relationships across them.

💡 Remember this: Multimodal AI works by mapping every kind of input — pixels, words, sound — into one shared vector space, so that comparing, retrieving, or generating across modalities all reduce to measuring distance between vectors.

24.1 — The Shared Embedding Space Idea

Imagine a library where every book, song, and photograph is placed on a shelf by meaning rather than by format. A photo of a beach and the phrase “sandy shore at sunset” end up next to each other; a picture of a stock chart and the word “finance” land in a different aisle. That single organizing principle — one coordinate system for all modalities — is the foundation of nearly everything in this chapter.

Concretely, a shared embedding space (or joint embedding space) is a vector space $\mathbb{R}^d$ into which we map inputs of different types using a separate encoder per modality. An image encoder $f_{\text{img}}$ turns a picture into a vector; a text encoder $f_{\text{txt}}$ turns a sentence into a vector of the same dimension $d$. The whole point is that distance means semantic similarity regardless of where a point came from. If $f_{\text{img}}(\text{cat photo})$ sits close to $f_{\text{txt}}(\text{"a cat"})$, the model has learned to align the two modalities.

Closeness is measured with cosine similarity — the cosine of the angle between two vectors, which ignores their length and cares only about direction:

\[\text{sim}(\mathbf{u}, \mathbf{v}) = \frac{\mathbf{u} \cdot \mathbf{v}}{\|\mathbf{u}\|\,\|\mathbf{v}\|}\]

In words: measure how much two vectors point the same way, ignoring how long they are — pure direction, not magnitude. Also written: $\text{sim}(\mathbf{u},\mathbf{v}) = \hat{\mathbf{u}} \cdot \hat{\mathbf{v}}$ where $\hat{\mathbf{u}} = \mathbf{u}/\|\mathbf{u}\|$ — i.e. the plain dot product of the two unit-normalized vectors.

A value near $1$ means “very similar,” near $0$ means “unrelated,” and near $-1$ means “opposite.”

The diagram below shows the basic shape: two towers, two modalities, one space.

flowchart LR
  I[🖼️ Image] --> EI[Image encoder]
  T[📝 Text] --> ET[Text encoder]
  EI --> Z[(Shared space ℝᵈ)]
  ET --> Z
  Z --> S[cosine similarity → match?]

Here is the idea in miniature. Suppose after training we have these 2-D embeddings (real spaces are hundreds of dimensions, but the math is identical):

import numpy as np
def cos(u, v):                       # cosine similarity
    return u @ v / (np.linalg.norm(u) * np.linalg.norm(v))

img_cat = np.array([0.9, 0.1])       # embedding of a cat photo
txt_cat = np.array([0.8, 0.2])       # embedding of "a cat"
txt_car = np.array([0.1, 0.9])       # embedding of "a car"

print(round(cos(img_cat, txt_cat), 3))   # 0.997  → strong match
print(round(cos(img_cat, txt_car), 3))   # 0.346  → weak match

The cat photo is far closer to “a cat” than to “a car” — exactly the alignment we want. Below, a sketch of such a space:

Tip

Intuition: a shared space turns “is this image about cats?” into “is this image’s vector near the word cat’s vector?” — a plain dot product. Once everything is a vector in one space, cross-modal problems collapse into nearest-neighbor lookups.

The alignment gap. Even in a well-trained shared space, image vectors and text vectors don’t sit perfectly on top of each other. Picture two clouds of points that point the same way but live in two slightly separate clumps — all the image vectors in one clump, all the text vectors in another, with a thin no-man’s-land between them. This modality gap is mostly a leftover from how training starts (random weights) and the temperature knob in contrastive learning. It’s usually harmless: ranking a text query against images still works, because you’re always comparing across the gap the same way. The one thing to avoid is averaging an image vector with a text vector — the midpoint lands in the empty gap, matching nothing.

24.2 — CLIP: Contrastive Image–Text Pretraining

The famous model that made the shared space practical at scale is CLIP (Contrastive Language–Image Pre-training, OpenAI 2021). Its recipe is almost embarrassingly direct: scrape ~400 million (image, caption) pairs from the web, and train two encoders so that each image lands near its own caption and far from everyone else’s.

Intuition first. Picture a speed-dating event with $N$ images on one side and their $N$ captions on the other. Every image was secretly written for exactly one caption. The training game: for each image, pick its true caption out of the lineup. To win, the encoders must learn what genuinely makes an image and a sentence “about the same thing” — and the other $N-1$ captions in the room act as free negative examples that the model must learn to push away. No human labels are needed; the pairing itself is the supervision.

The training mechanism is contrastive learning. Take a batch of $N$ image–text pairs. Encode all images and all texts, then form the $N \times N$ matrix of cosine similarities between every image and every text. The diagonal entries are the true pairs (image $i$ with its caption $i$); the off-diagonal entries are mismatches. CLIP’s loss pushes the diagonal up and the off-diagonal down — a softmax over each row and each column, the InfoNCE objective:

\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\text{sim}(I_i, T_i)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(I_i, T_j)/\tau)}\]

In words: for each image, treat “which caption is mine?” as a multiple-choice question over the whole batch, and add up how surprised the model is by the right answer; minimizing surprise pulls true pairs together and shoves the rest apart. Also written: with rows L2-normalized this is a cross-entropy on logits $\tfrac{1}{\tau} I N^\top$ with targets on the diagonal — $\mathcal{L} = \tfrac{1}{N}\sum_i \text{CE}\big(\text{softmax}_j(\text{sim}(I_i,T_j)/\tau),\, y_i = i\big)$ — and the full CLIP loss averages this image→text term with its symmetric text→image counterpart.

where $\tau$ is a learned temperature that sharpens or softens the distribution. Read it as a classification problem: “given image $i$, which of the $N$ captions is the right one?”

Why bigger batches help. The other captions in the batch are the negatives, so a batch of $N$ gives each image $N-1$ things to contrast against. Double the batch and every example sees twice as many distractors — which is exactly why CLIP-style training is so hungry for large batches (tens of thousands of pairs) and why much of the engineering effort goes into fitting those batches across many GPUs.

flowchart TB
  subgraph Batch of N pairs
    direction LR
    A[Images I1..IN] --> IE[Image encoder]
    B[Texts T1..TN] --> TE[Text encoder]
  end
  IE --> M[N×N similarity matrix]
  TE --> M
  M --> L["pull diagonal up, push off-diagonal down (InfoNCE)"]

The similarity matrix is the heart of it. Here is a doodle of what training is trying to carve out — a bright diagonal, dim everywhere else:

Worked example. Take a tiny batch of 3 pairs. After encoding, suppose the similarity matrix (rows = images, cols = texts) is:

	T₁ “dog”	T₂ “beach”	T₃ “pizza”
I₁ 🐕	0.90	0.10	0.20
I₂ 🏖️	0.15	0.85	0.05
I₃ 🍕	0.25	0.00	0.80

The bold diagonal is high and off-diagonal low — a well-trained batch. The loss for row 1 is $-\log\frac{e^{0.90/\tau}}{e^{0.90/\tau}+e^{0.10/\tau}+e^{0.20/\tau}}$; with $\tau=0.1$ this is tiny, because the diagonal dominates. Early in training the matrix is roughly uniform and the loss is large; gradient descent steadily carves out the diagonal.

import numpy as np
def info_nce(sim, tau=0.1):
    N = sim.shape[0]
    logits = sim / tau
    logits -= logits.max(1, keepdims=True)          # numerical stability
    p = np.exp(logits) / np.exp(logits).sum(1, keepdims=True)
    return -np.log(p[np.arange(N), np.arange(N)]).mean()   # diagonal = true pairs

sim = np.array([[.9,.1,.2],[.15,.85,.05],[.25,0,.8]])
print(round(info_nce(sim), 3))     # ~0.10  small loss, good alignment

The payoff is zero-shot classification. CLIP can classify images into categories it was never explicitly trained on, because classification becomes a similarity lookup. To label a photo, you write each candidate class as a sentence — “a photo of a dog”, “a photo of a cat” — encode all of them, encode the image, and pick the class whose text vector is closest.

# zero-shot: no labeled training, just compare image to class prompts
classes  = ["a photo of a dog", "a photo of a cat", "a photo of a car"]
img_vec  = encode_image(photo)                 # CLIP image encoder
txt_vecs = [encode_text(c) for c in classes]   # CLIP text encoder
scores   = [cos(img_vec, t) for t in txt_vecs]
print(classes[int(np.argmax(scores))])         # highest similarity wins

With a real framework. In practice you reach for open_clip (or Hugging Face transformers) and the whole zero-shot pipeline is a few lines:

import torch, open_clip
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k")
tokenizer = open_clip.get_tokenizer("ViT-B-32")

image  = preprocess(Image.open("pet.jpg")).unsqueeze(0)          # (1,3,224,224)
labels = ["a photo of a dog", "a photo of a cat", "a photo of a car"]
text   = tokenizer(labels)

with torch.no_grad():
    img_f = model.encode_image(image)
    txt_f = model.encode_text(text)
    img_f /= img_f.norm(dim=-1, keepdim=True)        # L2-normalize → cosine
    txt_f /= txt_f.norm(dim=-1, keepdim=True)
    probs = (100.0 * img_f @ txt_f.T).softmax(dim=-1)  # temperature-scaled

print(labels[int(probs.argmax())], float(probs.max()))

This same trick powers semantic image search (encode a query sentence, find nearest image vectors), content moderation, and the text-conditioning inside many generators. The wording of the prompt matters — “a photo of a dog” usually beats the bare word “dog,” a quirk called prompt engineering for CLIP. A common refinement is prompt ensembling: encode the same class through many templates (“a photo of a {}”, “a blurry photo of a {}”, “a close-up of a {}”), average the resulting text vectors, and classify against that mean — a free accuracy bump that smooths over any single template’s quirks.

Warning

CLIP inherits the biases and noise of web data, and it is bag-of-concepts, not compositional: it often scores “a red cube on a blue sphere” and “a blue cube on a red sphere” almost identically because it keys on which concepts are present more than how they relate. Don’t assume CLIP understands spatial or relational structure.

24.3 — Binding Many Modalities Through One Anchor

CLIP aligns two modalities by training on paired data. But what if you want six modalities — image, text, audio, depth, thermal, motion — in one space? Collecting aligned data for every pair (image–audio, audio–depth, depth–thermal, …) explodes combinatorially. The elegant fix, popularized by Meta’s ImageBind (2023), is to bind everything to a single anchor modality.

Intuition. Think of a train station with one central hub. You don’t need a direct track between every pair of towns; you just need every town connected to the hub. Route through the hub and any town can reach any other. ImageBind makes images the hub: it trains image↔︎text, image↔︎audio, image↔︎depth, and so on — always pairing each modality with images, never with each other. The surprise is emergent alignment: because audio was pulled toward images and text was pulled toward images, audio and text end up aligned too, even though no audio–text pairs were ever shown.

flowchart TB
  IMG((🖼️ Image\nanchor)) --- T[📝 Text]
  IMG --- A[🔊 Audio]
  IMG --- D[🌐 Depth]
  IMG --- TH[🌡️ Thermal]
  IMG --- M[📈 IMU motion]
  T -. emergent .- A
  A -. emergent .- D

The mechanism is just CLIP-style contrastive training, repeated once per (image, X) pairing, all sharing the same image encoder so every modality is pulled into one common frame of reference:

\[\mathcal{L}_{\text{bind}} = \sum_{X \in \{\text{text, audio, depth, ...}\}} \mathcal{L}_{\text{InfoNCE}}\big(f_{\text{img}}, f_X\big)\]

In words: for each non-image modality, run the usual “match each image to its partner” contrastive loss against images, and add them all up — one shared image encoder anchors them all. Also written: $\mathcal{L}_{\text{bind}} = \sum_X \tfrac{1}{N}\sum_i \text{CE}\big(\text{softmax}_j \text{sim}(f_{\text{img}}(I_i), f_X(x_j))/\tau,\; y_i=i\big)$ — a sum of per-modality CLIP losses sharing the anchor encoder $f_{\text{img}}$.

The practical payoff is cross-modal arithmetic and retrieval you never trained for: hum a tune and retrieve matching images, or add an image embedding of a beach to an audio embedding of waves to query a generator. Once modalities share a hub, you compose them like word vectors. This anchor-and-bind pattern is the conceptual generalization of CLIP from a two-tower model to an n-tower one.

Tip

Intuition: you don’t need every pairwise bridge — you need one good hub. Bind every modality to a shared anchor (images, because image data co-occurs with almost everything), and the rest of the alignment comes for free.

24.4 — Vision-Language Models & Multimodal LLMs

CLIP can match an image to text but cannot talk about it. To get fluent description and reasoning, we bolt vision onto a language model, producing a vision-language model (VLM) — also called a multimodal LLM. The core question is mechanical: an LLM consumes a sequence of token embeddings, so how do we turn an image into tokens the LLM will accept?

The standard answer has three parts. First, a vision encoder (usually a pretrained CLIP-style Vision Transformer; see Attention & Transformers) splits the image into patches and produces a grid of patch embeddings — say $256$ vectors for a $16\times16$ grid. Second, a small projector (a linear layer or tiny MLP, as in the LLaVA models, or a learned resampler that compresses many patches into a few query tokens, as in Flamingo/BLIP-2’s Q-Former) maps those vision vectors into the LLM’s embedding dimension. Third, those projected vectors are inserted into the token stream as if they were ordinary word embeddings, and the LLM proceeds exactly as it would on text.

flowchart LR
  IMG[🖼️ Image] --> VE[Vision encoder ViT]
  VE --> P[Patch embeddings]
  P --> PR[Projector → LLM dim]
  PR --> SEQ
  TXT[📝 What is in this image?] --> TOK[Tokenizer]
  TOK --> SEQ[Token sequence: img tokens + text tokens]
  SEQ --> LLM[Transformer LLM]
  LLM --> OUT[📝 A cat on a sofa.]

The mental model: an image becomes a handful of “soft tokens” — vectors that occupy slots in the prompt but carry visual rather than lexical meaning. From the LLM’s perspective the prompt is just a longer sequence; attention does the rest, letting text tokens look at image tokens and vice versa.

# sketch of a VLM forward pass
patches    = vit(image)                # (256, 1024) patch embeddings
img_tokens = projector(patches)        # (256, 4096) → LLM embedding dim
txt_tokens = embed(tokenize("Describe this image."))
seq = concat([img_tokens, txt_tokens]) # image tokens sit IN the prompt
answer = llm.generate(seq)             # ordinary autoregressive decoding

With a real framework. Modern VLMs ship ready-to-run on Hugging Face. A LLaVA-style model in a few lines:

from transformers import LlavaForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model = LlavaForConditionalGeneration.from_pretrained(
    "llava-hf/llava-1.5-7b-hf", torch_dtype=torch.float16, device_map="auto")
proc  = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt = "USER: <image>\nHow many people are wearing hats? ASSISTANT:"
inputs = proc(images=Image.open("crowd.jpg"), text=prompt,
              return_tensors="pt").to(model.device, torch.float16)

out = model.generate(**inputs, max_new_tokens=40)
print(proc.decode(out[0], skip_special_tokens=True))

The <image> placeholder is exactly where the processor splices the projected patch tokens into the prompt — the “soft tokens” made concrete.

Two flagship tasks fall out of this design. Image captioning asks the model to generate a description from the image alone (“a golden retriever catching a frisbee on grass”). Visual question answering (VQA) conditions on image plus a question (“How many people are wearing hats?” → “Three”). Because the backbone is a full LLM, modern VLMs go far beyond these: reading documents, interpreting charts, writing code from a UI screenshot, or doing step-by-step visual reasoning.

Training is typically two-stage. First alignment pretraining: freeze the vision encoder and the LLM, and train only the projector on image–caption pairs so the soft tokens land in meaningful regions of the LLM’s space — cheap, because few parameters move. Then instruction tuning on multimodal conversations (“look at this and answer…”) so the model learns to follow visual instructions, not just caption.

The token-count tax. A practical cost that bites in production: image tokens are expensive. A single high-resolution image can expand into hundreds or thousands of soft tokens, all of which sit in the LLM’s context window and incur attention cost. This is why resampler designs (Q-Former, Perceiver) that compress many patches into a few query tokens matter — and why feeding a VLM ten images can blow your context budget faster than ten paragraphs of text. When latency or cost is tight, downscale the image or pick a model with an aggressive resampler.

Tip

Rule of thumb: a VLM is “an LLM with extra eyes.” Almost all the heavy lifting is the frozen vision encoder and the frozen LLM; the learnable bridge between them (the projector) is small. That is why you can build a capable VLM on a modest budget by reusing two strong pretrained pieces.

24.5 — Text-to-Image and Text-to-Video

Run the shared space in reverse and you get generation: instead of mapping an image to text, you synthesize an image from text. Modern text-to-image systems are overwhelmingly diffusion models (the full mechanism — forward noising, the denoising network, samplers, and latent diffusion — is the subject of Chapter 18). Here we cover only the multimodal hinge: how the words steer the pixels.

The animation below shows the loop: random noise on the left, gradually denoised into a picture on the right, with the text prompt pulling each step toward the target.

The link is conditioning. The denoiser is not asked to remove noise blindly; it removes noise given a text embedding. A text encoder (often the CLIP text encoder of §24.2, or a T5 language model) turns the prompt into a sequence of vectors, and the denoising U-Net or transformer attends to them via cross-attention (§24.7) at every step. The text acts as a steering wheel nudging each denoising step toward images consistent with the prompt.

flowchart LR
  P[📝 a fox in snow] --> TE[Text encoder]
  TE --> C[Text embeddings]
  N[Random noise] --> D[Denoiser]
  C -->|cross-attention guides each step| D
  D -->|iterate t = T … 1| D
  D --> IMG[🖼️ Generated image]

The strength of that steering is set by classifier-free guidance (CFG). At each step the model predicts the denoising direction twice — once with the prompt and once without (an empty prompt) — and extrapolates away from the unconditioned prediction:

\[\hat{\epsilon} = \epsilon_{\varnothing} + w\,(\epsilon_{\text{text}} - \epsilon_{\varnothing})\]

In words: start from the model’s “no instructions” guess, see which way the prompt would tug it, and then step extra hard in that direction — the bigger $w$, the more the prompt wins over the generic guess. Also written: $\hat{\epsilon} = (1-w)\,\epsilon_{\varnothing} + w\,\epsilon_{\text{text}}$ — a linear blend (an extrapolation once $w>1$) of the unconditioned and text-conditioned predictions.

The guidance scale $w$ trades faithfulness against diversity: $w=1$ ignores the extrapolation (looser, more varied), while $w=7$–$12$ pulls hard toward the prompt (sharper adherence, but risks over-saturated or unnatural images). It is the single knob users feel most.

The doodle below shows the extrapolation geometry: the final direction overshoots the text-conditioned guess, away from the unconditioned one.

# classifier-free guidance: blend conditioned and unconditioned predictions
eps_text   = denoiser(x_t, t, cond=text_emb)    # follow the prompt
eps_uncond = denoiser(x_t, t, cond=empty_emb)   # ignore the prompt
w = 7.5                                          # guidance scale
eps = eps_uncond + w * (eps_text - eps_uncond)  # extrapolate toward prompt
x_next = step(x_t, eps, t)                       # one denoising step

With a real framework. The diffusers library wraps all of this — text encoder, scheduler, CFG loop — behind one call; guidance_scale is the $w$ above:

from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda")

image = pipe(
    prompt="a red fox sitting in fresh snow, golden hour, photorealistic",
    num_inference_steps=30,
    guidance_scale=7.5,            # w: faithfulness vs. diversity
).images[0]
image.save("fox.png")

Text-to-video extends the same idea into time. The hard new constraint is temporal consistency: the fox must keep the same fur and move coherently across frames, not flicker into a different animal each frame. Systems achieve this by generating in a compressed latent space and adding temporal attention layers that let frames attend to one another, plus 3-D (space + time) convolutions. Because video is heavy, models often generate a few keyframes and interpolate, or produce low resolution then upscale. The trajectory from Imagen Video and Make-A-Video to Sora and Veo is essentially diffusion plus ever-better temporal modeling.

Warning

Push the guidance scale too high and images “burn” — garish colors, fried contrast, mangled hands and text. More guidance is not more quality; it is more obedience, and past a point obedience destroys realism. Tune $w$, don’t max it.

24.6 — Audio-Language and Any-to-Any Models

Sound fits the shared-space recipe just as cleanly as images. The trick is to make audio look like something a transformer already eats. Most systems first convert a waveform into a spectrogram — a 2-D image of frequency (vertical) versus time (horizontal) — and then treat it almost exactly like a picture (the audio front-end, spectrograms, and speech specifics live in Speech & Audio Processing).

A spectrogram is the bridge: it redraws sound as a picture, so the same vision machinery applies. The doodle below shows a sound’s energy shimmering across frequency (up) and time (right) — a 2-D image a transformer can read patch by patch, just like a photo.

Audio-language models pair an audio encoder with a text encoder in one space, mirroring CLIP. CLAP (Contrastive Language–Audio Pretraining) trains on (sound, description) pairs so that the clip of a barking dog lands near the text “a dog barking,” giving zero-shot audio classification and text-to-sound retrieval. On the generation side, Whisper maps speech-spectrograms to text (transcription/translation), while text-to-audio diffusion models like AudioLDM run §24.5 in the sound domain to synthesize effects and music from prompts.

With a real framework. Whisper transcription is one pipeline call:

from transformers import pipeline
asr = pipeline("automatic-speech-recognition", model="openai/whisper-small")
print(asr("meeting.wav")["text"])     # spectrogram → text, under the hood

The frontier is any-to-any models: a single network whose inputs and outputs can each be text, image, audio, or video in any combination. The enabling idea is tokenizing every modality into one shared discrete vocabulary. A VQ (vector-quantized) tokenizer turns an image patch or an audio frame into an integer code from a learned codebook — the same way a tokenizer turns text into integer IDs. Once everything is a sequence of tokens drawn from one combined vocabulary, a single transformer can read a mixed sequence and predict the next token, whether that token decodes back into a word, a pixel block, or an audio snippet.

flowchart LR
  T[📝 Text] --> TT[Text tokens]
  I[🖼️ Image] --> IT[Image VQ tokens]
  A[🔊 Audio] --> AT[Audio VQ tokens]
  TT --> U[Unified token stream]
  IT --> U
  AT --> U
  U --> TR[One transformer]
  TR --> O[Next token → decode to text / image / audio]

This unification is what lets a model take a spoken question about a photo and reply with both spoken words and a generated illustration — the inputs and outputs are all just tokens in the same stream. Models such as GPT-4o, Gemini, and Chameleon move in this direction, processing voice, vision, and text natively rather than gluing separate systems together.

Where this shows up: the voice mode in modern assistants is exactly this — you speak (audio tokens in), the model “sees” your camera feed (image tokens in), and it speaks back (audio tokens out) with no separate speech-to-text-to-speech pipeline bolted on. Because it’s one stream, the model can be interrupted mid-sentence and can react to a tone of voice, which a chain of separate systems handles clumsily.

Tip

Intuition: “tokens are a universal currency.” Once an image, a sound, and a sentence are all sequences of integers from one vocabulary, the difference between captioning, transcription, and text-to-image is just which tokens you feed in and which you ask the model to produce.

24.7 — Fusion Strategies: Early, Late, and Cross-Attention

A recurring design choice underlies every model so far: at what point do the modalities meet? This is the question of fusion, and there are three canonical answers, each with a different cost–power tradeoff.

A kitchen analogy keeps them straight. Early fusion is throwing every ingredient into one pot from the start — flavors blend completely, but you can’t take the soup apart again. Late fusion is plating two finished dishes side by side and judging them together — clean and reusable, but they never actually mixed. Cross-attention is letting one dish taste the other and adjust its seasoning — they influence each other where it matters, while staying distinct dishes.

Early fusion combines modalities at the input — concatenate raw features or tokens and feed one joint stream to a single model (as in the any-to-any transformers of §24.6). This is maximally expressive, because the model can mix the modalities at every layer, but it is heavy: everything must be processed together, and the two streams cannot be reused independently.

Late fusion keeps the modalities separate through their own encoders and combines them only at the end — concatenating or averaging the final vectors before a small head, or just comparing them as in CLIP. It is modular and efficient (encode each modality once, even offline) and lets you swap encoders, but it can only model shallow interactions, since the streams never see each other internally.

Cross-attention fusion is the middle path that dominates modern VLMs. One modality forms the queries and attends into the other modality’s keys and values, so information flows between streams at chosen layers without fully merging them. In a VLM, text tokens (queries) attend to image patches (keys/values); in diffusion (§24.5), image features attend to text. Attention computes, for query $\mathbf{q}$ over keys $K$ and values $V$:

\[\text{Attn}(\mathbf{q}, K, V) = \text{softmax}\!\left(\frac{\mathbf{q}K^\top}{\sqrt{d}}\right)V\]

In words: score how well the query matches every key, turn those scores into weights that sum to 1, and return a weighted blend of the values — a soft “look up the most relevant items and average them.” Also written: $\sum_j \alpha_j \mathbf{v}_j$ with weights $\alpha_j = \dfrac{\exp(\mathbf{q}\cdot\mathbf{k}_j/\sqrt{d})}{\sum_{j'}\exp(\mathbf{q}\cdot\mathbf{k}_{j'}/\sqrt{d})}$ — an attention-weighted sum of the value vectors.

— a content-addressed lookup that pulls in exactly the relevant patches (the mechanics of attention are Chapter 17; here it is the bridge between modalities).

flowchart TB
  subgraph Early
    a1[Img feats] --> j[concat] --> m1[Joint model] --> o1[Out]
    a2[Txt feats] --> j
  end
  subgraph Late
    b1[Img encoder] --> c[combine at end] --> o2[Out]
    b2[Txt encoder] --> c
  end
  subgraph CrossAttention
    d1[Txt queries] --> x[cross-attention]
    d2[Img keys/values] --> x --> o3[Out]
  end

Strategy	Where they meet	Interaction depth	Cost / modularity	Typical use
Early	input tokens	deepest	costly, not modular	any-to-any transformers
Late	final vectors	shallowest	cheap, very modular	CLIP, retrieval, two-tower
Cross-attention	middle layers	deep, selective	moderate	VLMs, diffusion conditioning

Worked contrast. For a VQA question “what color is the umbrella?”, late fusion would compress the whole image into one vector and hope the color survived — fragile. Cross-attention lets the word umbrella query the patches, attend to the umbrella region, and read its color directly. That selectivity is why cross-attention beats late fusion on fine-grained tasks.

Warning

Don’t reflexively reach for early fusion because it is “most powerful.” It multiplies compute and destroys modularity — you can no longer precompute and cache image embeddings for fast retrieval. For search at scale, late fusion’s reusable vectors are the right answer; save cross-attention for tasks that truly need fine-grained interaction.

24.8 — Multimodal Retrieval-Augmented Generation

A VLM knows only what its weights absorbed in training — it cannot recall your product catalog, this patient’s prior scans, or yesterday’s news photo. Multimodal retrieval-augmented generation (multimodal RAG) fixes that by bolting the shared embedding space (§24.1) onto a generative model: retrieve relevant items across modalities, then condition the generator on them.

Intuition. It is an open-book exam. Instead of forcing the VLM to answer a question about a photo from memory, you first let it pull the most relevant pages — images, captions, table rows — out of a library, lay them on the desk, and answer with the book open. The shared space is what makes “most relevant” computable across formats: encode the query (text or image), find nearest neighbors among indexed items of any modality, and feed the winners to the model.

flowchart LR
  Q[📝/🖼️ Query] --> E[Encoder → query vector]
  E --> R[Nearest-neighbor search]
  DB[(Vector index:\nimages + text + tables)] --> R
  R --> K[Top-k items across modalities]
  K --> G[VLM generator]
  Q --> G
  G --> A[📝 Grounded answer]

The pipeline has two halves. Index (offline): encode every catalog image, document page, and caption with a CLIP-style encoder and store the vectors in an approximate-nearest-neighbor index (FAISS, a vector DB; see Information Retrieval & Data Mining). Query (online): encode the user’s query into the same space, retrieve the top-$k$ neighbors, and splice them into the VLM’s prompt as context. Because everything shares one space, a text query can retrieve images and vice versa — true cross-modal recall.

# multimodal RAG: index images, retrieve by text query, then generate
import numpy as np, open_clip, torch, faiss
from PIL import Image

model, _, prep = open_clip.create_model_and_transforms(
    "ViT-B-32", pretrained="laion2b_s34b_b79k")
tok = open_clip.get_tokenizer("ViT-B-32")

def embed_image(path):
    with torch.no_grad():
        v = model.encode_image(prep(Image.open(path)).unsqueeze(0))
    return (v / v.norm(dim=-1, keepdim=True)).numpy().astype("float32")

# --- offline: build the index ---
paths = ["a.jpg", "b.jpg", "c.jpg"]
index = faiss.IndexFlatIP(512)                 # inner product on unit vectors = cosine
index.add(np.vstack([embed_image(p) for p in paths]))

# --- online: retrieve by a text query ---
with torch.no_grad():
    q = model.encode_text(tok(["a dog playing in snow"]))
    q = (q / q.norm(dim=-1, keepdim=True)).numpy().astype("float32")
scores, ids = index.search(q, k=2)             # top-2 nearest images
retrieved = [paths[i] for i in ids[0]]
# → feed `retrieved` images + the question into a VLM (e.g. LLaVA) to answer

The payoff is grounded, updatable, attributable generation. Update the knowledge by re-indexing, not retraining; cut hallucination (§24.10) because the model answers from retrieved evidence rather than priors; and cite which retrieved image or page an answer came from. Real systems use it for visual product search (“find the lamp in this photo, in stock under $80”), document QA over scanned PDFs with figures, and medical report drafting against a library of prior cases.

Tip

Intuition: retrieval is how you give a frozen model fresh, private, or rare knowledge without touching its weights. The shared embedding space is the quiet engine — it is what lets a sentence fetch a picture and a picture fetch a paragraph from the very same index.

24.9 — Evaluating Multimodal Models

Once you have built a captioner or a VQA system, an immediate question follows: how good is it, and how would you know? Multimodal evaluation is genuinely harder than single-modality evaluation, because a “correct” caption is not unique — a photo can be described in a hundred valid ways — and because a model can be right for the wrong reasons (guessing “tennis” whenever it sees a green court, with no real grounding).

Evaluation splits into two families. Discriminative tasks have a checkable answer: VQA accuracy (did it say “three”?), image–text retrieval measured by Recall@K (is the true caption among the top-$K$ nearest?), and zero-shot classification accuracy on benchmarks like ImageNet. These are easy to score but can be gamed by dataset shortcuts.

Generative tasks — captioning, text-to-image — have no single ground truth, so we lean on overlap metrics and learned scores. The classic worked example is captioning: compare the model’s sentence against several human reference captions.

Metric	What it measures	Caveat
BLEU / CIDEr / METEOR	n-gram overlap with reference captions	rewards wording, not correctness
CLIPScore	cosine similarity between image and generated caption (no references needed)	inherits CLIP’s blind spots
FID (text-to-image)	distance between real and generated image feature distributions	needs many samples; not per-image
Human / VLM-as-judge	holistic preference, faithfulness	slow or noisy; judge has its own biases

A tiny CIDEr-style intuition: if the model writes “a dog catching a frisbee” and a human reference says “a dog catches a frisbee on grass,” the shared n-grams (“a dog”, “catching/catches”, “frisbee”) drive the score up, while a fabricated word like “umbrella” earns nothing — but note the metric would also reward a fluent wrong caption that happened to reuse common phrases. That gap is exactly why CLIPScore (which checks the image, not just the words) and human judgment remain necessary.

# reference-free caption quality with CLIPScore (torchmetrics)
from torchmetrics.multimodal.clip_score import CLIPScore
import torch

clip_score = CLIPScore(model_name_or_path="openai/clip-vit-base-patch16")
images   = torch.randint(0, 255, (1, 3, 224, 224))           # a batch of images
captions = ["a red fox sitting in fresh snow"]
print(clip_score(images, captions).item())   # higher = caption fits image better

Warning

n-gram metrics (BLEU, CIDEr) reward sounding like the references, not being true to the image. A caption can score well while hallucinating, or score poorly while being a perfectly valid alternative phrasing. Never report a single number — pair an overlap metric with a grounding-aware one (CLIPScore) and spot-check with human eyes.

24.10 — Applications and Challenges: Grounding & Cross-Modal Hallucination

The shared-space toolkit powers a wide application surface: semantic search over mixed media, automatic captioning and alt-text for accessibility, visual question answering for documents and charts, content moderation, robotics (mapping camera + instruction to action), medical-image reporting, and creative text-to-image/video tools. But two hard problems recur, and both trace back to the same root — the model has the words without a firm hold on the referents.

The first is grounding: connecting a symbol to the specific thing it denotes. A grounded model that says “the cat is left of the dog” actually located both animals and checked their spatial relation; an ungrounded one pattern-matched a plausible caption. Grounding failures show up as wrong counts (“how many chairs?”), confused spatial relations, and — recalling §24.2 — CLIP’s blindness to “red cube on blue sphere” vs. “blue cube on red sphere.” Stronger grounding is pursued with region-level supervision: training on bounding boxes and referring expressions (“the second person from the left”) so the model learns to point, not just to name.

The second, and most notorious, is cross-modal hallucination: the model confidently describes things that are not in the input. A VLM may report “a person holding an umbrella” for a rainy street with no umbrella, because its language prior — rain co-occurs with umbrellas in training text — overrides what the pixels show. The danger is that fluent, well-formed language makes these fabrications sound authoritative.

flowchart TB
  IMG[🖼️ Rainy street, no umbrella] --> VLM[VLM]
  PRIOR[Language prior: rain implies umbrellas] --> VLM
  VLM --> OUT["⚠️ 'a person holding an umbrella'"]
  OUT --> NOTE[Hallucination: language prior overrode the pixels]

Mitigations attack the imbalance between priors and evidence: stronger visual grounding so the model must attend to the region it describes; training that penalizes ungrounded claims; retrieval or tool use to verify facts (the multimodal RAG of §24.8 is a direct lever here — answering from retrieved evidence rather than memory); and asking the model to cite where in the image it looked. None fully solve it, so the practical rule is to treat multimodal outputs as drafts to verify, especially in high-stakes settings — a misread medical scan or a fabricated detail in a legal document carries real cost.

Warning

A multimodal model’s fluency is not evidence of faithfulness. The most dangerous failure is a perfectly worded sentence about something that isn’t there. Always ask whether a claim is grounded in the actual input before trusting it.

24.11 — Quick reference

Term / method	One-line meaning	When / why
Shared embedding space	One vector space $\mathbb{R}^d$ holding every modality	The foundation — turns cross-modal tasks into nearest-neighbor lookups
Cosine similarity	$\frac{\mathbf{u}\cdot\mathbf{v}}{\\|\mathbf{u}\\|\\|\mathbf{v}\\|}$ — direction match, ignores length	Scoring how close two embeddings are, across any modality
Modality gap	Image and text vectors form separate clumps that point the same way	Harmless for ranking; never average an image and a text vector
CLIP / InfoNCE	Contrastive loss pulling true image–text pairs together, mismatches apart	Building a shared space from web pairs; powers zero-shot classification
Temperature $\tau$	Learned scale that sharpens/softens the contrastive softmax	Lower $\tau$ = sharper separation of the diagonal
Zero-shot classification	Label an image by nearest class-prompt text vector	Classify novel categories with no labeled training data
ImageBind (anchor binding)	Bind every modality to one hub (images); cross-alignment emerges free	Adding many modalities without all-pairs aligned data
Soft tokens	Image patches projected into the LLM’s embedding dim, spliced into the prompt	How a VLM lets an LLM “see”; resamplers compress them to cut cost
Projector / Q-Former	Small learned bridge mapping vision vectors → LLM space	Train this cheap part first (alignment), then instruction-tune
Cross-attention conditioning	Denoiser/text attends across modalities at each layer	Steering pixels by a prompt in text-to-image; fine-grained VQA
Classifier-free guidance $w$	$\hat\epsilon=\epsilon_\varnothing+w(\epsilon_{\text{text}}-\epsilon_\varnothing)$	The main text-to-image knob; faithfulness vs. diversity, don’t max it
Spectrogram	Sound redrawn as a frequency × time image	Lets vision/transformer machinery process audio (CLAP, Whisper)
VQ tokenization	Encode image/audio chunks as integer codes from a codebook	Any-to-any models: one transformer over one shared vocabulary
Early / Late / Cross fusion	Mix at input / output / middle layers	Late = cacheable retrieval; cross = fine-grained; early = max-power, costly
Multimodal RAG	Retrieve cross-modal items from a shared-space index, then generate	Fresh/private knowledge, attribution, less hallucination, no retraining
Recall@K / CLIPScore / FID	Retrieval hit-rate / image-caption fit / image distribution distance	Evaluation — pair an overlap metric with a grounding-aware one
Grounding	Tying a word to the specific thing it denotes	Failures = wrong counts, confused spatial relations
Cross-modal hallucination	Language prior overrides the pixels (umbrella in the rain)	Treat fluent multimodal output as a draft to verify

24.12 — Key Takeaways

Shared embedding space is the unifying idea: map every modality into one vector space where distance = semantic similarity, then cross-modal tasks become nearest-neighbor lookups.
CLIP trains two encoders contrastively on web image–text pairs (pull true pairs together, push mismatches apart via InfoNCE), enabling zero-shot classification by comparing an image to text prompts; the other items in the batch are the negatives, so big batches help.
Binding through one anchor (ImageBind) generalizes CLIP from two towers to many: bind every modality to a shared hub (images), and cross-modal alignment between the other modalities emerges for free.
Vision-language models / multimodal LLMs turn image patches into soft tokens via a vision encoder + projector, insert them into an LLM’s prompt, and so do captioning and visual QA; train the cheap projector first, then instruction-tune. Image tokens are costly — resamplers compress them.
Text-to-image is diffusion (Ch. 18) conditioned on text via cross-attention, steered by classifier-free guidance $w$; text-to-video adds temporal attention for frame consistency.
Audio-language models (CLAP, Whisper) reuse the recipe on spectrograms; any-to-any models tokenize every modality into one shared vocabulary so a single transformer handles all input/output combinations.
Fusion has three modes — early (input, deep, costly), late (output, shallow, modular/cacheable), cross-attention (middle, deep + selective) — and cross-attention dominates modern VLMs.
Multimodal RAG retrieves relevant items across modalities from a shared-space index, then conditions the generator on them — giving fresh, private, attributable, hallucination-resistant answers without retraining.
Evaluation is two-sided: discriminative tasks use accuracy and Recall@K; generative tasks need overlap metrics (BLEU/CIDEr), grounding-aware scores (CLIPScore, FID), and human judgment — never a single number.
Grounding (tying words to referents) and cross-modal hallucination (language priors overriding pixels) are the central open challenges; treat fluent multimodal output as a draft to verify.

24.13 — See also

Attention & Transformers — the attention mechanism behind cross-attention fusion and VLM backbones.
Generative Models (Diffusion) — the full diffusion machinery underlying text-to-image and text-to-video.
Computer Vision — Vision Transformers and patch embeddings used as the eyes of every VLM.
Natural Language Processing — text encoders, tokenization, and language modeling on the other side of the bridge.
Large Language Models — the LLM backbones that multimodal LLMs extend with vision and audio.
Speech & Audio Processing — spectrograms and the audio front-ends reused by audio-language models.
Dimensionality Reduction & Embeddings — embeddings and vector spaces, the mathematical home of the shared space.
Retrieval-Augmented Generation & Vector Search — nearest-neighbor indexes and the retrieve-then-generate pattern behind multimodal RAG.
Explainable AI & Interpretability — methods for probing grounding and diagnosing hallucination.

↪ The thread continues → Chapter 25 · 🕹️ Reinforcement Learning

Everything so far learns from a fixed dataset. But intelligence also means acting and learning from consequences — trial, error, and reward.

📖 All chapters | ← 23 · 📚 Large Language Models | 25 · 🕹️ Reinforcement Learning →