Chapter 20 — 💬 Natural Language Processing

📖 All chapters | ← 19 · 👁️ Computer Vision | 21 · 🔊 Speech & Audio Processing →

📚 Jump to any chapter

🧮 Mathematical Foundations

🧭 The ML Workflow

🧩 Classical Machine Learning

🎲 Probabilistic Models

13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

🗣️ Applied AI: Vision, Language, Audio & Time

🕹️ Reinforcement Learning

25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

🚀 Production, Tooling & Infrastructure

📚 Classical & Symbolic AI

⚖️ Responsible AI & Frontier

🎓 Advanced & Specialized Topics

🎚️ Post-Training & Fine-Tuning

🚢 Model Serving & Deployment

47 · 🚢 Model Serving & Deployment in Production

Natural Language Processing (NLP) is the part of AI that teaches machines to read, understand, and generate human language — the messy, ambiguous medium people actually communicate in. It sits at the meeting point of linguistics, classical ML, and deep learning, and it underpins search engines, translators, spam filters, voice assistants, and the large language models that have reshaped the field. This chapter walks the full arc: from turning raw text into numbers, through the embeddings and language models that capture meaning, up to translation and dialogue systems.

🧭 In context: Applied AI · turning human language into representations a model can compute on, then predicting/generating language · the one key idea: meaning lives in context, so good NLP is the art of encoding context numerically.

💡 Remember this: every NLP system is the same two-step recipe — turn text into vectors that capture meaning from context, then do math on those vectors to predict, label, or generate.

The whole chapter is one climb — raw text gets steadily richer representations, and a small animated pulse runs along that ladder here to set the map in your head:

20.1 — Tokenization & text preprocessing

A computer cannot multiply the word “running” by a weight. Before any math happens, text must be sliced into discrete units and mapped to numbers. Tokenization is that slicing step: breaking a string into tokens — words, subwords, or characters — which become the atoms a model sees.

The naive approach is whitespace/word tokenization: split on spaces and punctuation. "The cats ran." → ["The", "cats", "ran", "."]. Simple, but it explodes the vocabulary (every inflection — run, runs, running, ran — is a separate token) and chokes on words it never saw in training (out-of-vocabulary, OOV).

Three classic tools tame the vocabulary:

Stop-word removal drops ultra-frequent, low-information words (the, is, of, a). Useful for bag-of-words search; harmful for tasks where not or the changes meaning.
Stemming chops words to a crude root with rule-based suffix stripping (Porter stemmer): running → run, studies → studi. Fast, but the stem need not be a real word.
Lemmatization maps a word to its dictionary lemma using morphology and part-of-speech: better → good, studies → study. Slower, linguistically correct.

# Tiny stemming-by-rule demo (toy Porter-style)
def stem(w):
    for suf in ("ing", "ies", "ed", "s"):      # order matters: longest first
        if w.endswith(suf) and len(w) - len(suf) >= 2:
            return w[:-len(suf)] + ("y" if suf == "ies" else "")
    return w

for w in ["running", "studies", "jumped", "cats"]:
    print(w, "->", stem(w))
# running -> runn   studies -> study   jumped -> jump   cats -> cat

Notice running → runn — that ugliness is exactly why real stemmers have dozens of rules, and why lemmatization is preferred when correctness matters.

The modern answer to the OOV problem is subword tokenization: build a vocabulary of frequent character chunks so that any word can be reconstructed from pieces. Think of it like LEGO: instead of stocking one pre-built model for every word you might ever say, you keep a box of small reusable bricks (un, believ, able) and snap together whatever word arrives — including ones you have never seen. Byte-Pair Encoding (BPE) starts from individual characters and greedily merges the most frequent adjacent pair, over and over, until the vocabulary reaches a target size.

Watch the bricks snap together: a never-before-seen word still tokenizes cleanly because the pieces already exist.

Worked BPE example on the corpus low low lower newest newest widest (treat each word as a sequence of characters with a word-boundary marker _):

Step	Most frequent pair	Merge → new token
1	`e` `s` (in newest, widest)	`es`
2	`es` `t`	`est`
3	`l` `o`	`lo`
4	`lo` `w`	`low`

After a few merges, newest tokenizes as new + est, and a brand-new word slowest becomes s + low + est — no OOV, no exploded vocabulary. This is why GPT, BERT, and friends all use subword schemes (BPE, WordPiece, or SentencePiece/Unigram).

The merge rule that BPE applies at each step is just “pick the pair that shows up most often”:

\[ (a, b) = \arg\max_{(x,y)} \ \text{count}(x, y) \]

In words: scan all adjacent token pairs in the corpus and merge the single pair that occurs most frequently into one new token; repeat. Also written: \(\text{merge} = \operatorname*{argmax}_{(x,y)\in \text{pairs}} \ \text{freq}(xy)\) — choose the bigram \(xy\) with maximum frequency.

This figure shows the same vocabulary trade-off the three schemes make — words are reliable but explode and break on new words, characters never break but lose all word identity, and subwords sit in the sweet spot:

flowchart LR
  A["Raw text<br/>'The cats ran.'"] --> B[Normalize<br/>lowercase, strip]
  B --> C[Tokenize]
  C --> D{Scheme?}
  D -->|word| E["The · cats · ran · ."]
  D -->|subword BPE| F["the · cat · s · ran · ."]
  E --> G[Map to integer IDs]
  F --> G

In real projects you almost never hand-roll a tokenizer; you load a pretrained one that ships with the model. Here is the idiomatic Hugging Face version:

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained("bert-base-uncased")
enc = tok("The cats ran unbelievably fast.")
print(tok.convert_ids_to_tokens(enc["input_ids"]))
# ['[CLS]', 'the', 'cats', 'ran', 'un', '##bel', '##ie', '##va', '##bly', 'fast', '.', '[SEP]']
# note: 'unbelievably' is split into WordPiece subwords (## = continuation)

Tip

Rule of thumb: for classical bag-of-words pipelines, lemmatize + remove stop-words. For anything feeding a neural net or transformer, use subword tokenization and skip stemming/stop-words entirely — the model learns what to ignore.

20.2 — Bag of Words & TF-IDF

Once you have tokens, the simplest way to turn a document into a fixed-length vector is the Bag of Words (BoW): ignore order entirely, and just count how often each vocabulary word appears. Picture emptying a sentence into a bag, shaking it, and only recording which words fell out and how many of each — the sequence is gone, only the tallies remain. The document becomes a vector as long as the vocabulary, mostly zeros (sparse).

Tiny corpus of two documents:

\(d_1\): “the cat sat”
\(d_2\): “the dog sat the mat”

Vocabulary = {the, cat, sat, dog, mat}. Count vectors:

	the	cat	sat	dog	mat
\(d_1\)	1	1	1	0	0
\(d_2\)	2	0	1	1	1

The problem: “the” dominates by raw count yet carries no topical signal. TF-IDF (Term Frequency–Inverse Document Frequency) fixes this by down-weighting words that appear in many documents. The intuition: a word is informative about a document only if it is common here but rare elsewhere. “The” is everywhere, so it tells you nothing; “amortization” in one report screams what that report is about. TF-IDF multiplies two factors:

\[ \text{tf}(t,d) = \frac{\text{count of } t \text{ in } d}{\text{total terms in } d}, \qquad \text{idf}(t) = \log\frac{N}{1 + \text{df}(t)} \]

In words: tf is how often a word shows up in this one document (its local prominence); idf is how rare the word is across the whole collection (its global distinctiveness). Multiply them so a word scores high only when it is both frequent here and rare everywhere else. Also written: \(\text{idf}(t) = \log N - \log(1 + \text{df}(t))\), and the final score in vector form is \(\text{tfidf}(t,d) = \text{tf}(t,d)\cdot\text{idf}(t)\).

where \(N\) is the number of documents and \(\text{df}(t)\) is how many documents contain \(t\). The score is \(\text{tfidf}(t,d) = \text{tf}(t,d)\cdot\text{idf}(t)\).

Our two-document corpus is too small to show idf working — with only two docs, every word is either in one or both, so the numbers collapse to near-zero or negative and look broken. That collapse is itself the lesson: idf only separates common from rare words when you have many documents. Picture a realistic corpus of thousands of news articles. “The” appears in nearly all of them, so its idf is ≈ 0 and TF-IDF erases it. “Cat” appears in just a handful, so its idf is large and TF-IDF lights it up. Give idf a big collection and it automatically surfaces each document’s distinctive words.

import numpy as np
docs = [["the","cat","sat"], ["the","dog","sat","the","mat"]]
vocab = sorted({w for d in docs for w in d})
N = len(docs)
df = {w: sum(w in d for d in docs) for w in vocab}
def tfidf(doc):
    v = np.zeros(len(vocab))
    for i, w in enumerate(vocab):
        tf = doc.count(w) / len(doc)
        idf = np.log(N / (1 + df[w]))
        v[i] = tf * idf
    return v
print(vocab)
print(tfidf(docs[0]).round(3))   # 'the','sat' shrink toward 0

In practice, scikit-learn does the whole thing in two lines — and is what you would actually ship:

from sklearn.feature_extraction.text import TfidfVectorizer

corpus = ["the cat sat", "the dog sat the mat"]
vec = TfidfVectorizer()
X = vec.fit_transform(corpus)          # sparse TF-IDF matrix, shape (2, n_vocab)
print(vec.get_feature_names_out())     # ['cat' 'dog' 'mat' 'sat' 'the']
print(X.toarray().round(3))            # rows = documents, columns = words

TF-IDF vectors plug directly into cosine similarity for search and into linear classifiers for text classification. They are a strong, interpretable baseline — but they know nothing about word meaning: “car” and “automobile” are as unrelated as “car” and “banana”. That gap is what embeddings close.

Warning

BoW/TF-IDF throw away word order. “Dog bites man” and “man bites dog” get identical vectors. When order or negation matters, you need sequence models (Recurrent & Sequence Models) or transformers (Attention & Transformers).

20.2b — n-grams: putting a little order back

Pure bag-of-words is amnesiac about order, but there is a cheap halfway fix: instead of counting single words (unigrams), count short adjacent runs of words — n-grams. A bigram model counts pairs like “not good”; a trigram counts triples like “not very good”. By treating “not good” as its own feature, the model can finally tell a negated phrase from a bare positive word.

For our sentence “the dog sat the mat”, the bigrams are:

	the dog	dog sat	sat the	the mat
count	1	1	1	1

This is why “New York” survives as a meaningful unit instead of dissolving into two ordinary words. The cost is a much larger, sparser feature space — the number of possible bigrams is roughly the vocabulary size squared — so n-grams trade memory for a small, local memory of order. In scikit-learn it is a one-argument change:

# include both single words and adjacent pairs as features
vec = TfidfVectorizer(ngram_range=(1, 2))   # unigrams + bigrams
X = vec.fit_transform(corpus)

n-grams are the bridge between order-blind BoW and the genuinely sequential models (RNNs, transformers) later in this chapter — and the same n-gram counting idea reappears in language modeling (Section 20.4) and in BLEU (Section 20.8).

20.3 — Word embeddings

A word embedding is a dense, low-dimensional vector (say 100–300 numbers) that represents a word’s meaning, learned so that words used in similar contexts land near each other in space. The guiding principle, the distributional hypothesis, is a line from linguist J.R. Firth: “You shall know a word by the company it keeps.” If two words show up around the same neighbors, they probably mean similar things. The everyday version: you can guess a stranger’s personality from the friends they hang out with — words work the same way.

Once words become points in space, “similar meaning” is just “nearby” — here a query word pulses and its nearest neighbors light up around it:

word2vec learns these vectors with a tiny prediction task slid over a corpus. It comes in two flavors:

CBOW (Continuous Bag of Words): given the surrounding context words, predict the center word. Fast, good for frequent words.
Skip-gram: given the center word, predict each surrounding context word. Slower, better for rare words and small corpora.

flowchart LR
  subgraph CBOW
    c1[context] --> ct((center?))
    c2[context] --> ct
    c3[context] --> ct
  end
  subgraph Skip-gram
    cn((center)) --> o1[context?]
    cn --> o2[context?]
    cn --> o3[context?]
  end

In skip-gram, the model wants the dot product of a center word’s vector \(v_c\) and a true context word’s vector \(u_o\) to be large; the softmax over the vocabulary gives the probability

\[ P(o \mid c) = \frac{\exp(u_o^\top v_c)}{\sum_{w \in V} \exp(u_w^\top v_c)}. \]

In words: the chance that word \(o\) appears near center word \(c\) grows with how aligned their two vectors are (big dot product = pointing the same way), normalized against every other word in the vocabulary so the probabilities sum to one. Also written: \(P(o \mid c) = \operatorname{softmax}_o\!\big(U v_c\big)\), where \(U\) stacks every word’s context vector into a matrix and the softmax is taken over its rows.

Because that denominator sums over the whole vocabulary, real implementations use negative sampling: for each true (center, context) pair, sample a few random “negative” words and just train the model to say yes to the real pair and no to the fakes — a handful of cheap logistic-regression updates instead of a giant softmax.

GloVe (Global Vectors) reaches a similar place from a different angle: it factorizes a global word co-occurrence matrix, fitting vectors so that \(v_i^\top v_j\) approximates \(\log\) of how often words \(i\) and \(j\) co-occur. word2vec is “predictive/local,” GloVe is “count-based/global”; in practice they yield comparably useful vectors.

The famous payoff is that analogies become vector arithmetic. The vector offset from man to woman turns out to be roughly the same as king to queen, so:

\[ \vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}. \]

In words: start at “king”, subtract the “maleness” direction (the step from man to woman, taken backward), add the “femaleness” direction, and you land essentially on “queen” — the geometry encodes the gender relationship as a reusable direction. Also written: \(\vec{king} - \vec{queen} \approx \vec{man} - \vec{woman}\) — the two gender offsets are (approximately) the same vector.

The two green arrows are nearly parallel and equal — that geometric regularity is why the arithmetic works. These embeddings are static: one vector per word, so “bank” (river vs. money) gets a single blurred vector. Contextual embeddings from transformers (Attention & Transformers) fix that by giving each occurrence its own vector.

In code, you rarely train word2vec from scratch — you load pretrained vectors and ask for neighbors:

import gensim.downloader as api

wv = api.load("glove-wiki-gigaword-100")          # 100-dim GloVe vectors
print(wv.most_similar("king", topn=3))            # [('queen', ...), ('prince', ...), ...]
print(wv.most_similar(positive=["king", "woman"], negative=["man"], topn=1))
# [('queen', 0.77)]  -> the king - man + woman analogy, computed for real

Tip

Cosine similarity, not Euclidean distance, is the standard way to compare embeddings — direction encodes meaning, magnitude largely encodes word frequency.

The cosine similarity that does this comparing is worth stating once:

\[ \cos(\theta) = \frac{a \cdot b}{\lVert a\rVert\,\lVert b\rVert}. \]

In words: measure the angle between two word vectors — point the same way (small angle) → score near 1 → similar meaning; point oppositely → near −1; perpendicular → 0 (unrelated). It deliberately ignores how long the vectors are. Also written: \(\cos(\theta) = \hat{a}\cdot\hat{b}\), the dot product of the two vectors after each is normalized to unit length (\(\hat{a} = a/\lVert a\rVert\)).

Worked cosine example with tiny 2-D vectors. Let \(a = \vec{cat} = (2, 1)\) and \(b = \vec{dog} = (3, 1)\):

Dot product: \(a\cdot b = 2\cdot 3 + 1\cdot 1 = 7\).
Lengths: \(\lVert a\rVert = \sqrt{4+1} = \sqrt5 \approx 2.24\), \(\lVert b\rVert = \sqrt{9+1} = \sqrt{10} \approx 3.16\).
\(\cos(\theta) = 7 / (2.24 \times 3.16) \approx 0.99\) — nearly 1, so “cat” and “dog” point almost the same way and count as very similar. A word like \(\vec{idea} = (-1, 2)\) gives \(a\cdot b = -2+2 = 0 \Rightarrow \cos = 0\): perpendicular, unrelated. The geometry matches the intuition.

20.4 — word2vec, in depth

Section 20.3 introduced word embeddings and named word2vec as the predictive way to learn them. This section opens the box: how skip-gram actually turns a sliding window of words into trained vectors, why the training objective is reshaped into a handful of tiny logistic problems, and which knobs decide whether the vectors come out topical or syntactic. Everything here is still the distributional hypothesis — “a word is known by the company it keeps” — but made mechanical.

The skip-gram architecture

Skip-gram is a deceptively shallow network: one hidden layer, no nonlinearity, trained on a fake task it never actually cares about. The setup is a center word \(c\) and the context words inside a window around it; the task is “given the center word, predict each context word.” We will never use the predictions — we only want the weights the network is forced to learn to make those predictions well.

Concretely the data path is three steps. The center word arrives as a one-hot vector of length \(V\) (vocabulary size): all zeros except a single 1 at the word’s index. Multiplying that one-hot by an input embedding matrix \(W\) of shape \(V \times d\) is just a row lookup — it selects row \(c\) of \(W\), which is the center word’s vector \(v_c \in \mathbb{R}^d\). That \(v_c\) is then scored against an output matrix \(W'\) of shape \(d \times V\), whose columns \(u_w\) are the context vectors; the dot products \(u_w^\top v_c\) become logits, and a softmax turns them into a distribution over which word sits in the context.

The crucial structural fact: every word owns two vectors — a center (input) vector in \(W\) and a context (output) vector in \(W'\). The same word “dog” has one row in \(W\) (used when “dog” is the center) and one column in \(W'\) (used when “dog” is a context word). After training you throw \(W'\) away and keep \(W\) as your embeddings (or, commonly, average the two).

Negative sampling, in detail

The honest skip-gram softmax of Section 20.3 has a denominator that sums over the entire vocabulary — hundreds of thousands of dot products per training word. Negative sampling sidesteps that sum entirely. Instead of asking “which of all \(V\) words is the context?”, it asks a much cheaper yes/no question: for the true pair \((c, o)\), push their vectors together; for a few random “negative” words \(w_1,\dots,w_k\), push their vectors apart. Each training step is then just \(k+1\) tiny logistic regressions. The objective to maximize for one center–context pair is

\[ J = \log \sigma(u_o^\top v_c) + \sum_{i=1}^{k} \log \sigma(-u_{w_i}^\top v_c), \]

where \(\sigma(x) = 1/(1+e^{-x})\) is the logistic sigmoid and the \(w_i\) are the sampled negatives.

In words: make the real context word’s vector align with the center vector (drive \(\sigma(u_o^\top v_c)\) toward 1, so its log toward 0), and make each random negative word’s vector anti-align (drive \(\sigma(-u_{w_i}^\top v_c)\) toward 1, i.e. \(u_{w_i}^\top v_c\) toward \(-\infty\)). One “yes” and \(k\) “no”s per step. Also written: using \(\sigma(-x) = 1 - \sigma(x)\), the equivalent minimized loss is \(\mathcal{L} = -\log \sigma(u_o^\top v_c) - \sum_{i=1}^{k} \log\big(1 - \sigma(u_{w_i}^\top v_c)\big)\) — a binary cross-entropy with label 1 on the true pair and label 0 on each negative.

Where do the negatives come from? Not uniformly, and not from the raw word frequencies — but from the unigram distribution raised to the 3/4 power:

\[ P_n(w) = \frac{f(w)^{3/4}}{\sum_{w'} f(w')^{3/4}}. \]

The \(3/4\) exponent is a deliberate compromise found empirically by the word2vec authors. Sampling proportional to raw frequency \(f(w)\) would drown training in ultra-common words like “the”; sampling uniformly would waste updates on rare junk. The \(3/4\) power flattens the distribution part-way — it pulls down the towering frequent words and lifts the rare ones, so negatives are drawn from a healthier mix that still respects frequency but is not dominated by it.

Subsampling and the dynamic window

Two more tricks sharpen the vectors, both targeting the same problem: frequent words are loud but uninformative.

Subsampling of frequent words randomly discards word occurrences during training, with a drop probability that rises for common words:

\[ P_{\text{drop}}(w) = 1 - \sqrt{\frac{t}{f(w)}}, \]

where \(f(w)\) is the word’s frequency and \(t\) is a small threshold (typically \(\sim 10^{-5}\)). A word rarer than \(t\) is essentially never dropped; “the” is dropped most of the time. This thins out the billions of near-useless “the … the … the” co-occurrences, which both speeds training and — counter-intuitively — improves the vectors, because the rare, meaning-bearing words now dominate the gradient.

The dynamic context window does something similar on the other axis. Rather than always using a fixed window of size \(m\), word2vec samples a window size uniformly between 1 and \(m\) for each center word. The effect is a soft distance weighting: nearby words land in the window on almost every sample, while words at the far edge of the maximum window only count when a large size happens to be drawn. Closer context — which is more syntactically and semantically relevant — therefore gets weighted more heavily, for free, without any explicit decay term.

Both tricks point the same direction: spend the model’s capacity on the informative, nearby, rarer words and stop letting frequent or distant words blur the vectors.

Hierarchical softmax (the alternative)

Negative sampling is the usual choice, but word2vec ships a second way to dodge the \(V\)-way softmax: hierarchical softmax. It arranges the vocabulary as the leaves of a binary Huffman tree (frequent words near the root, rare words deep). The probability of a word becomes the product of left/right branching decisions along its root-to-leaf path, each decision a single logistic unit. A word’s probability then costs \(O(\log V)\) dot products instead of \(O(V)\), and crucially the probabilities are still normalized for free by the tree structure. Rule of thumb from practice: hierarchical softmax tends to do better on rare words (their short Huffman-coded paths share parameters), while negative sampling is simpler, often faster, and usually better on frequent words — which is why negative sampling became the default.

Hyperparameters that matter

Hyperparameter	Typical range	Intuition
Vector dim \(d\)	100–300	More dimensions hold finer distinctions but cost memory and risk overfitting small corpora
Window size	2–10	Larger → topical (“doctor”≈“hospital”); smaller → syntactic (“doctor”≈“dentist”)
#negatives \(k\)	5–20	Small corpora want more (5–20); huge corpora do fine with 2–5
Epochs	5–15	More passes help small data; large corpora converge in a few
min-count	5–100	Drop words rarer than this — too-rare words never get enough context to learn a good vector

The window size is the lever worth remembering: widen it and the vectors capture what topic a word belongs to; narrow it and they capture what grammatical slot a word fills.

Framework code

In practice you train word2vec with gensim in a few lines — feed it tokenized sentences and ask for neighbors or analogies:

from gensim.models import Word2Vec

sentences = [["the", "king", "ruled", "the", "land"],
             ["the", "queen", "ruled", "wisely"]]            # real corpora: millions of these
model = Word2Vec(sentences, vector_size=100, window=5,
                 negative=10, min_count=1, sg=1, epochs=10)   # sg=1 → skip-gram
print(model.wv.most_similar("king", topn=3))
# analogy: king - man + woman ≈ ?
print(model.wv.most_similar(positive=["king", "woman"], negative=["man"], topn=1))

To see what negative=10 is really doing, here is the same skip-gram-with-negative-sampling forward pass and loss from scratch in PyTorch — two embedding tables, a sigmoid, nothing else:

import torch, torch.nn as nn, torch.nn.functional as F

class SkipGramNS(nn.Module):
    def __init__(self, vocab, dim):
        super().__init__()
        self.center  = nn.Embedding(vocab, dim)   # W  : center vectors v_c
        self.context = nn.Embedding(vocab, dim)    # W' : context vectors u_w

    def forward(self, c, pos, neg):                # c:(B,) pos:(B,) neg:(B,k)
        v_c  = self.center(c)                      # (B, d)
        u_o  = self.context(pos)                   # (B, d)
        u_n  = self.context(neg)                   # (B, k, d)
        pos_score = (u_o * v_c).sum(-1)            # u_o · v_c        -> (B,)
        neg_score = torch.bmm(u_n, v_c.unsqueeze(-1)).squeeze(-1)  # (B, k)
        # maximize log σ(pos) + Σ log σ(-neg)  ==  minimize the negative
        loss = -(F.logsigmoid(pos_score) + F.logsigmoid(-neg_score).sum(-1))
        return loss.mean()

The F.logsigmoid(-neg_score) term is exactly the \(\sum \log \sigma(-u_{w_i}^\top v_c)\) from the objective above — the model’s “say no to the fakes” half.

Evaluating the vectors

Embeddings are judged two ways. Intrinsic evaluation tests the vectors directly: the word-analogy task (does \(\vec{king}-\vec{man}+\vec{woman}\) land nearest “queen”?) and word-similarity benchmarks like WordSim-353 or SimLex-999, which compare cosine similarities against human-rated word-pair similarity scores. Extrinsic evaluation ignores the geometry and asks the only question that ultimately matters: do these vectors help a downstream task? — feed them into an NER tagger, a sentiment classifier, or a retrieval system and measure that system’s accuracy. Intrinsic tests are fast and diagnostic; extrinsic tests are slow but decisive, and the two do not always agree, so good practice reports both.

The limitation that defined what came next

word2vec’s vectors are static: each word gets exactly one vector, fixed after training. That single vector must average over all the word’s senses. “bank” by the river and “bank” holding your money collapse into one blurred point that is neither — and “apple” the fruit and “Apple” the company are forced to share an address. There is no mechanism for the vector to change based on the sentence it appears in.

This is precisely the gap that contextual models closed. By reading the whole sentence, Attention & Transformers give each occurrence of a word its own vector — “bank” near “river” and “bank” near “loan” come out different — and that contextual representation is the foundation the modern Large Language Models are built on. word2vec is where the embedding idea was made practical; contextual embeddings are where it was made dynamic. (For the geometry-of-vectors ideas underlying all of this, see dimensionality reduction; for the same skip-gram trick applied to graphs rather than text, see node2vec.)

20.5 — Language models (n-gram to neural; perplexity)

A language model (LM) assigns probabilities to sequences of words — equivalently, it predicts the next word given the previous ones. Intuitively it is a very well-read autocomplete: having seen oceans of text, it knows that after “I poured myself a cup of” the word “coffee” is far likelier than “concrete”. This single capability powers autocomplete, speech recognition, translation, and ChatGPT. By the chain rule, the probability of a sentence factorizes as

\[ P(w_1,\dots,w_n) = \prod_{i=1}^{n} P(w_i \mid w_1,\dots,w_{i-1}). \]

In words: the probability of a whole sentence is the running product of “how likely is each next word, given everything written so far” — read the sentence left to right, multiplying one next-word probability at a time. Also written: in log space (how it is actually computed) this is a sum, \(\log P(w_1,\dots,w_n) = \sum_{i=1}^{n} \log P(w_i \mid w_{<i})\), where \(w_{<i}\) is shorthand for all words before position \(i\).

The full history is impossible to count, so the classical n-gram model makes a Markov assumption: the next word depends only on the previous \(n-1\) words. A bigram model (\(n=2\)) estimates probabilities by simple counting:

\[ P(w_i \mid w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i)}{\text{count}(w_{i-1})}. \]

In words: to guess the next word, look only at the single word before it; the probability is just “of all the times this previous word appeared, what fraction of the time was it followed by this next word.” Also written: \(P(w_i \mid w_{i-1}) = \dfrac{c(w_{i-1} w_i)}{\sum_{w} c(w_{i-1} w)}\) — the denominator is equivalently the total count of the previous word as a left-context.

Worked bigram example. Training text: "I like cats . I like dogs ."

count(“I like”) = 2, count(“I”) = 2 → \(P(\text{like}\mid \text{I}) = 2/2 = 1.0\)
count(“like cats”) = 1, count(“like”) = 2 → \(P(\text{cats}\mid\text{like}) = 0.5\)

So the model assigns \(P(\text{cats}\mid\text{like}) = 0.5\) and \(P(\text{dogs}\mid\text{like}) = 0.5\) — it learned that “like” is followed equally by cats or dogs. If we now ask about “like fish”, the count is 0 → probability 0, which would zero out the whole sentence. The fix is smoothing (e.g., add-one/Laplace), which shaves probability mass off seen events and hands it to unseen ones.

Add-one (Laplace) smoothing makes the zero impossible by pretending every pair was seen one extra time:

\[ P_{\text{add-1}}(w_i \mid w_{i-1}) = \frac{\text{count}(w_{i-1}, w_i) + 1}{\text{count}(w_{i-1}) + |V|}. \]

In words: give every possible next word a tiny “starter credit” of one count so nothing is ever flatly impossible; the denominator grows by the vocabulary size \(|V|\) to keep the probabilities summing to one. Also written: the general additive form \(P_{\text{add-}k} = \dfrac{c(w_{i-1}w_i) + k}{c(w_{i-1}) + k|V|}\), with \(k=1\) recovering Laplace.

n-grams are simple and fast but suffer two fatal limits: they cannot see beyond \(n-1\) words (no long-range context), and the table grows explosively. Neural language models replace the count table with a network that maps word embeddings of the context to a softmax over the next word — generalizing to unseen contexts because similar words share similar vectors. RNNs (Recurrent & Sequence Models) and then Transformers (Attention & Transformers) extended the context window dramatically.

How do we score an LM? With perplexity (PPL) — intuitively, the average number of equally-likely choices the model feels it has at each step. A perplexity of 8 means the model is, on average, as unsure as if it were rolling an 8-sided die for each next word. Lower is better. For a test sequence,

\[ \text{PPL} = \exp\!\left(-\frac{1}{N}\sum_{i=1}^{N} \ln P(w_i \mid w_{<i})\right). \]

In words: take the model’s average surprise per word (the average negative log-probability it assigned to the words that actually came), then undo the log to turn it back into an effective “number of choices”; the more confident and correct the model, the smaller this number. Also written: \(\text{PPL} = \left(\prod_{i=1}^{N} P(w_i \mid w_{<i})\right)^{-1/N}\) — the inverse geometric mean of the per-word probabilities.

A perplexity of 1 means perfect prediction; a perplexity of \(|V|\) means the model is no better than guessing uniformly. If our bigram model assigns probabilities \(0.5, 0.5, 1.0\) to a 3-word test, \(\text{PPL} = \exp(-\frac13(\ln 0.5 + \ln 0.5 + \ln 1.0)) = \exp(0.462) \approx 1.59\) — between 1 and 2 effective choices per word.

import numpy as np
probs = [0.5, 0.5, 1.0]                       # P(word | history) for test tokens
ppl = np.exp(-np.mean(np.log(probs)))
print(round(ppl, 2))                          # 1.59

Warning

Perplexity is only comparable across models that share the same tokenization and vocabulary. A subword model and a word model can’t be ranked by raw PPL. And low perplexity ≠ useful output — it measures prediction, not truthfulness or helpfulness.

20.6 — Named Entity Recognition & POS tagging (sequence labeling)

Many NLP tasks are sequence labeling: assign a label to every token in a sentence, where the labels depend on each other. Think of a highlighter pass over a page — you color each word according to its job, but the colors follow rules (a “middle-of-a-name” color can’t appear unless a “start-of-a-name” color came just before it). Two staples:

Part-of-Speech (POS) tagging: label each word with its grammatical role — noun, verb, adjective, etc.
Named Entity Recognition (NER): find and classify spans naming real-world things — people, organizations, locations, dates.

Because entities can span multiple tokens, NER uses the BIO scheme: B- marks the Beginning of an entity, I- marks tokens Inside it, and O marks tokens Outside any entity.

Token	Jane	Smith	works	at	Acme	Corp	.
POS	PROPN	PROPN	VERB	ADP	PROPN	PROPN	PUNCT
NER (BIO)	B-PER	I-PER	O	O	B-ORG	I-ORG	O

“Jane Smith” is one PERSON span (B-PER then I-PER); “Acme Corp” is one ORG span. The labels are not independent — I-PER can only follow B-PER or I-PER, never O. Capturing those transition constraints is the heart of sequence labeling.

Classical models do this with a Conditional Random Field (CRF), which scores a whole label sequence using both how well each label fits its word (emission) and how compatible adjacent labels are (transition), then picks the best path with the Viterbi algorithm:

\[ \text{score}(x, y) = \sum_{i} \big(\underbrace{\psi(y_i, x_i)}_{\text{emission}} + \underbrace{T(y_{i-1}, y_i)}_{\text{transition}}\big). \]

In words: rate a candidate labeling of the whole sentence by adding up two things at every position — how well the label fits the word under it (emission), and how legal it is to jump from the previous label to this one (transition) — then keep the labeling with the highest total. Also written: \(\text{score}(x,y) = \sum_i \psi(y_i, x_i) + \sum_i T(y_{i-1}, y_i)\), splitting the per-word fit and the label-to-label compatibility into two separate sums.

flowchart LR
  W1[Jane] --> T1((B-PER))
  W2[Smith] --> T2((I-PER))
  W3[works] --> T3((O))
  T1 -. legal .-> T2
  T2 -. legal .-> T3
  T1 -. "O→I-PER illegal" .-x TX((I-PER))

The modern recipe is a BiLSTM (Recurrent & Sequence Models) or Transformer (Attention & Transformers) encoder that reads the whole sentence both directions to produce context-aware token vectors, often with a CRF layer on top to enforce legal transitions. Fine-tuned BERT-style models are today’s default for NER and POS.

In practice a pretrained pipeline gives you spans straight out of the box:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Jane Smith works at Acme Corp in Tokyo.")
for ent in doc.ents:
    print(ent.text, ent.label_)
# Jane Smith PERSON
# Acme Corp  ORG
# Tokyo      GPE

Tip

The transition matrix is doing real work: even a mediocre per-token classifier improves sharply once a CRF forbids impossible label sequences like O → I-ORG.

20.7 — Sentiment analysis & text classification

Text classification assigns a whole document to one of a fixed set of categories: spam vs. ham, topic = {sports, politics, tech}, or — the canonical example — sentiment analysis, labeling text as positive, negative, or neutral. It is the workhorse of applied NLP: review monitoring, content moderation, support-ticket routing.

The pipeline is the general supervised-learning loop (AI, ML & the Learning Process) wearing a text costume: turn each document into a vector (TF-IDF or embeddings), then train any classifier (logistic regression, Naive Bayes, an LSTM, or a fine-tuned transformer).

Naive Bayes is the classic, fast sentiment baseline. The “naive” intuition: pretend each word votes on the sentiment independently, ignoring how words combine, then tally the votes weighted by how strongly each word leans positive or negative. It applies Bayes’ rule and “naively” assumes words are independent given the class:

\[ P(\text{pos} \mid \text{doc}) \propto P(\text{pos}) \prod_{w \in \text{doc}} P(w \mid \text{pos}). \]

In words: the document’s score for “positive” is the base rate of positive documents multiplied by each word’s individual probability under the positive class — every word chips in a multiplicative vote, and whichever class scores higher wins. Also written: in log space, \(\log P(\text{pos}\mid\text{doc}) = \log P(\text{pos}) + \sum_{w\in\text{doc}} \log P(w\mid\text{pos}) + \text{const}\) — products of probabilities become a sum of log-probabilities.

Worked example. Suppose training gives these word likelihoods:

word	\(P(w\mid\text{pos})\)	\(P(w\mid\text{neg})\)
great	0.40	0.05
terrible	0.04	0.45
movie	0.20	0.20

Classify “great movie” with equal priors \(P(\text{pos})=P(\text{neg})=0.5\):

pos score: \(0.5 \times 0.40 \times 0.20 = 0.040\)
neg score: \(0.5 \times 0.05 \times 0.20 = 0.005\)

Positive wins by 8×. Notice “movie” is uninformative (same likelihood both ways) and cancels out — the polarity words drive the decision.

import numpy as np
Lpos = {"great":0.40, "terrible":0.04, "movie":0.20}
Lneg = {"great":0.05, "terrible":0.45, "movie":0.20}
def classify(doc):
    p = 0.5 * np.prod([Lpos[w] for w in doc])   # assumes every word is in vocab
    n = 0.5 * np.prod([Lneg[w] for w in doc])
    return "pos" if p > n else "neg", round(p,4), round(n,4)
print(classify(["great","movie"]))   # ('pos', 0.04, 0.005)

The real-world version is a short scikit-learn pipeline that vectorizes and classifies in one object:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline

X = ["great movie", "terrible film", "loved it", "awful waste"]
y = ["pos", "neg", "pos", "neg"]
clf = make_pipeline(TfidfVectorizer(), MultinomialNB()).fit(X, y)
print(clf.predict(["a great film"]))   # ['pos']

And the modern, highest-accuracy route — a fine-tuned transformer behind a one-line Hugging Face pipeline:

from transformers import pipeline
sentiment = pipeline("sentiment-analysis")     # downloads a fine-tuned model
print(sentiment("This movie was not great at all."))
# [{'label': 'NEGATIVE', 'score': 0.99}]  <- handles the negation BoW misses

The trap in sentiment is negation and sarcasm: “not great” flips polarity but a bag-of-words model sees the positive word “great”. Word-order-aware models (LSTM, transformer) handle this far better, which is why fine-tuned transformers now dominate sentiment benchmarks. For evaluating these classifiers — accuracy, precision, recall, F1 — see Section 20.11 and Model Evaluation & Tuning.

Warning

Multiplying many small probabilities underflows to zero on real documents. Always work in log space: sum \(\log P(w\mid c)\) instead of multiplying. The toy example above is short enough to skip it, but production code must not.

20.8 — Topic modeling (LDA)

Sometimes you have a pile of documents and no labels, and you want to discover the hidden themes running through them. Imagine inheriting ten thousand unsorted news clippings and wanting to know, without reading them all, “what topics is this stack actually about?” — that is exactly the job. Topic modeling is this unsupervised task, and Latent Dirichlet Allocation (LDA) is its classic engine. The core idea is a generative story: pretend each document was written by a simple random process, then run it backward to infer the hidden structure.

LDA’s story for writing a document:

A document is a mixture of topics — e.g., 70% “sports”, 30% “finance”.
Each topic is a probability distribution over words — “sports” puts high probability on game, team, score; “finance” on market, stock, price.
To generate each word: pick a topic from the document’s mixture, then pick a word from that topic’s word distribution.

flowchart TD
  D[Document] --> M[Topic mixture<br/>70% sports, 30% finance]
  M --> P{pick topic<br/>per word}
  P --> S[Topic: sports<br/>game .3 team .25 score .2]
  P --> F[Topic: finance<br/>market .3 stock .25 price .2]
  S --> W1[word: 'team']
  F --> W2[word: 'stock']

Training inverts this story: given only the observed words, LDA infers (via Gibbs sampling or variational inference) the two hidden sets of distributions — each document’s topic mixture, and each topic’s word distribution. The Dirichlet prior in the name is a distribution-over-distributions that gently pushes mixtures to be sparse: most documents use few topics, most topics emphasize few words — which matches how real text behaves and keeps topics interpretable.

The output is human-readable. A topic is just its top words; you read them and name it:

Topic	Top words (by probability)	You label it
1	game, team, season, player, win	Sports
2	market, stock, price, bank, rate	Finance
3	film, movie, director, role, star	Cinema

Fitting LDA in scikit-learn is a few lines — and the top words per topic come straight out of the learned components:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

docs = ["team won the game", "stock market price rose", "the film director won"]
counts = CountVectorizer().fit(docs)
lda = LatentDirichletAllocation(n_components=2, random_state=0).fit(counts.transform(docs))
terms = counts.get_feature_names_out()
for k, topic in enumerate(lda.components_):
    top = [terms[i] for i in topic.argsort()[-3:][::-1]]
    print(f"topic {k}:", top)

LDA needs you to choose the number of topics \(K\) up front, ignores word order (it’s bag-of-words underneath), and gives “soft” topics you must interpret. Modern alternatives embed documents with transformers and cluster them (e.g., BERTopic), often producing cleaner topics — but LDA remains the interpretable, probabilistic baseline worth understanding.

Tip

LDA topics are discovered, not defined — there’s no guarantee they match the categories you care about. Treat the top-words list as a starting point for human naming, and try a few values of \(K\).

20.9 — Machine translation (statistical to neural seq2seq, BLEU)

Machine translation (MT) converts text from a source language to a target language. Its history is a clean illustration of NLP’s broader shift from hand-engineered statistics to end-to-end neural learning.

Statistical Machine Translation (SMT), dominant into the 2010s, decomposed translation with Bayes’ rule into a translation model (which target words/phrases correspond to source phrases, learned from aligned bilingual corpora) and a language model (does the target output read fluently?). Phrase-based SMT stitched together translated chunks and reordered them with complex heuristics — powerful but brittle, with many separately-tuned components.

Neural Machine Translation (NMT) replaced the whole pipeline with one network trained end-to-end: the sequence-to-sequence (seq2seq) architecture. The intuition: one network reads the whole source sentence and squeezes its meaning into a vector (like jotting the gist on a sticky note), and a second network writes the translation from that note, one word at a time. An encoder RNN reads the source sentence into a context vector; a decoder RNN generates the target sentence one word at a time, each step conditioned on what it has produced so far.

flowchart LR
  subgraph Encoder
    x1[Je] --> h1 --> h2
    x2[t'aime] --> h2 --> h3
  end
  h3 -->|context| d1
  subgraph Decoder
    d1[I] --> d2[love] --> d3[you]
  end

The early bottleneck: cramming an entire sentence into one fixed context vector lost information on long inputs — like summarizing a paragraph on a single sticky note, the longer the input the more falls off the edge. The fix — attention (Attention & Transformers) — lets the decoder look back at all encoder states and focus on the relevant source words at each step. Attention was invented for translation before it grew into the Transformer that now defines the field.

To measure translation quality automatically, the standard metric is BLEU (Bilingual Evaluation Understudy). It compares the machine output to one or more human reference translations by counting overlapping n-grams (shared 1-grams, 2-grams, 3-grams, 4-grams), then combining them. Two ingredients keep it honest:

Modified n-gram precision: a word can only be credited up to the number of times it appears in the reference (so spamming “the the the” doesn’t score).
Brevity penalty (BP): multiplies the score down if the candidate is shorter than the reference, preventing a one-word “high-precision” cheat.

\[ \text{BLEU} = \text{BP}\cdot\exp\!\Big(\sum_{n=1}^{4} w_n \log p_n\Big), \qquad \text{BP} = \min\!\big(1,\ e^{1 - r/c}\big) \]

In words: BLEU is the geometric average of how much the candidate’s 1-, 2-, 3-, and 4-word chunks overlap the reference, then shrunk by a penalty if the candidate came out too short. Also written: the exponential-of-log-sum is exactly a weighted geometric mean, \(\text{BLEU} = \text{BP}\cdot\prod_{n=1}^{4} p_n^{w_n}\), with equal weights \(w_n = 1/4\).

where \(p_n\) is the modified \(n\)-gram precision, \(w_n=1/4\), \(c\) is candidate length and \(r\) reference length. Worked unigram example — reference: “the cat is on the mat” (length 6); candidate: “the cat the mat” (length 4):

Unigram matches: “the” appears 2× in candidate but is capped at 2 (its reference count) → credited 2; “cat” → 1; “mat” → 1. Total clipped matches = 4 of 4 candidate words → \(p_1 = 4/4 = 1.0\).
Brevity penalty: \(c=4 < r=6\) → \(\text{BP} = e^{1 - 6/4} = e^{-0.5} \approx 0.607\).
So even with perfect unigram precision, BLEU is pulled down to ~0.61 by the brevity penalty — the metric punishes the too-short output.

In practice you call a library implementation (here sacrebleu, the de-facto standard for comparable scores) rather than re-deriving the clipping yourself:

import sacrebleu
refs = [["the cat is on the mat"]]                 # one reference set
sys  = ["the cat the mat"]                          # system output
print(sacrebleu.corpus_bleu(sys, refs).score)       # corpus-level BLEU (0-100)

BLEU is cheap and correlates with human judgment in aggregate, but it is surface-level: it rewards exact n-gram overlap and misses valid paraphrases (“big” vs. “large”). Newer metrics like METEOR (allows synonyms/stems) and embedding-based BERTScore (semantic similarity) address this; see Section 20.11.

Warning

BLEU is meaningful only as a corpus-level average against the same reference set. A single-sentence BLEU is noisy, and scores aren’t comparable across datasets or different numbers of references.

20.10 — Text generation & decoding (summarization, sampling)

Translation is one instance of a much broader family: conditional text generation (Generative Models), where a model produces an output sequence from some input. Summarization is the other everyday member — condensing a long document into a short one — and it splits into two styles. Extractive summarization copies the most important sentences verbatim (safe, never hallucinates, but choppy); abstractive summarization rewrites the gist in new words with a seq2seq or LLM (fluent, but can invent facts). The same encoder–decoder machinery from translation powers abstractive summarization; only the training data changes (document → summary instead of source → target).

Whatever the task, once the model gives you a probability distribution over the next token, you still have to choose a token — and that choice, the decoding strategy, matters as much as the model. The intuition: the model hands you a ranked menu of next words each step; decoding is your policy for ordering off that menu. Greedy always orders the top item; sampling rolls dice weighted by the menu; beam search keeps several half-finished plates and serves the best.

Greedy decoding: always take the single highest-probability token. Fast, but myopic — locking in the best word now can paint you into a corner later, and it produces bland, repetitive text.
Beam search: keep the \(k\) best partial sequences (“beams”) at every step and expand all of them, so a locally worse word that leads somewhere better can survive. Standard for translation and summarization where there is one “right” answer.
Sampling (temperature, top-\(k\), top-\(p\)/nucleus): draw the next token at random from the distribution. Temperature \(T\) flattens (\(T>1\), more surprising) or sharpens (\(T<1\), more conservative) it; top-\(k\) samples only from the \(k\) likeliest tokens; top-\(p\) samples from the smallest set whose probabilities sum to \(p\). This is what gives chatbots and story generators their creativity.

Temperature reshapes the distribution before sampling by dividing the logits \(z\) by \(T\):

\[ P(w) = \frac{\exp(z_w / T)}{\sum_{v}\exp(z_v / T)}. \]

In words: before turning scores into probabilities, divide every score by the temperature — a big \(T\) squashes the gaps so even unlikely words get a real shot (creative, wild), a small \(T\) exaggerates the gaps so the top word almost always wins (focused, safe). Also written: \(P(w) = \operatorname{softmax}(z / T)_w\); the limit \(T \to 0\) recovers greedy decoding (argmax), while \(T \to \infty\) approaches a uniform random pick.

The dial in motion: as temperature breathes up and down, the same logits go from a sharp single spike to a flat spread — sharp means “safe and predictable”, flat means “creative and risky”.

A concrete summarization example shows the extractive/abstractive split. Source: “The committee met on Friday. After a long debate, it approved the new budget.”

Style	Output
Extractive	“It approved the new budget.” (a sentence lifted as-is)
Abstractive	“Budget passed after debate.” (reworded, shorter)

In code, modern generation is one Hugging Face call, and the decoding knobs are just arguments:

from transformers import pipeline

summarizer = pipeline("summarization")           # abstractive, seq2seq under the hood
text = ("The committee met on Friday. After a long debate, "
        "it approved the new budget for the coming year.")
print(summarizer(text, max_length=20, min_length=5)[0]["summary_text"])

# decoding knobs on a generator (greedy vs. sampled):
gen = pipeline("text-generation", model="gpt2")
gen("Once upon a time", do_sample=True, temperature=0.8, top_p=0.9, max_new_tokens=20)

Tip

Match the decoder to the task: beam search when there is essentially one correct output (translation, summarization, code), sampling with temperature/top-p when you want diversity and voice (dialogue, story writing, brainstorming).

20.11 — Chatbots & dialogue (retrieval vs generative, intent/slot)

A dialogue system (chatbot) carries on a conversation with a user. Two architectural families, and a spectrum between them — think of it as a help-desk worker reading from a binder of pre-approved answers (retrieval) versus one improvising replies on the spot (generative):

Retrieval-based: the bot has a fixed library of candidate responses and selects the best-matching one for the user’s message (rank by similarity, return top answer). Safe and controllable — it never says anything off-script — but limited to its library and can feel rigid.
Generative: the bot writes a fresh response token-by-token with a language model (seq2seq or, today, an LLM). Fluent and flexible, but can hallucinate facts or go off the rails.

flowchart TD
  U[User message] --> K{Architecture}
  K -->|Retrieval| R[Match against<br/>response library] --> RB[Return best<br/>existing reply]
  K -->|Generative| G[LLM generates<br/>new tokens] --> GB[Novel reply]

Modern assistants blend both: Retrieval-Augmented Generation (RAG) retrieves relevant documents, then generates an answer grounded in them — combining retrieval’s factual control with generation’s fluency. The mental model: an open-book exam — the model looks up the relevant pages first, then writes the answer in its own words instead of reciting from memory. (The retrieval half of RAG is exactly the semantic search of Section 20.12.)

Task-oriented bots (booking a flight, ordering food) traditionally run a Natural Language Understanding (NLU) front end built on two jobs:

Intent classification: what does the user want? (“book_flight”, “check_weather”) — a text-classification problem (Section 20.6).
Slot filling: extract the parameters the intent needs — a sequence-labeling problem (Section 20.5).

Worked example. User: “Book a flight to Tokyo on Friday.”

	Result
Intent	`book_flight`
Slot: destination	Tokyo
Slot: date	Friday

With the intent and filled slots in hand, the dialogue manager decides the next action — ask for the missing departure city, confirm, or call the booking API. If a slot is empty, the bot asks a follow-up; this slot-tracking loop is what gives task bots their structured, goal-directed feel.

flowchart LR
  M[User utterance] --> I[Intent classifier]
  M --> S[Slot filler]
  I --> DM[Dialogue manager]
  S --> DM
  DM -->|slot missing| Q[Ask follow-up]
  DM -->|all slots filled| A[Call API / respond]

LLM-based assistants increasingly fold intent and slot extraction into a single prompt (often emitting structured JSON), but the intent/slot decomposition is still the clearest mental model for what a task bot must figure out. In practice this looks like asking an LLM to return the parse as JSON:

# Sketch: modern LLM-as-NLU — one prompt replaces the intent + slot models
prompt = '''Extract intent and slots as JSON.
User: "Book a flight to Tokyo on Friday."'''
# LLM returns:
# {"intent": "book_flight", "slots": {"destination": "Tokyo", "date": "Friday"}}

For the LLM techniques behind modern chatbots — instruction tuning, RAG, function calling — see Large Language Models.

Tip

Choose by risk tolerance: retrieval (or RAG) when wrong answers are costly and must stay on-brand; pure generation when open-ended fluency matters more than guaranteed correctness.

20.12 — Evaluation metrics

NLP spans classification, sequence labeling, generation, and ranking, and each needs its own yardstick. Choosing the wrong metric is one of the most common ways an NLP project quietly fails. Here is the map.

For classification & sequence labeling (sentiment, NER, intent), the core metrics come from the confusion matrix (Model Evaluation & Tuning): precision = of what I labeled positive, how much was right; recall = of the true positives, how many I caught; F1 = their harmonic mean.

\[ P = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}, \quad F_1 = \frac{2PR}{P+R}. \]

In words: precision asks “when I shouted ‘positive’, how often was I right?”; recall asks “of all the real positives out there, how many did I find?”; F1 blends the two so you can’t win by acing one and tanking the other. Also written: \(F_1 = \left(\dfrac{P^{-1} + R^{-1}}{2}\right)^{-1}\) — the harmonic mean of precision and recall, equivalently \(\dfrac{2TP}{2TP + FP + FN}\).

For NER, F1 is computed at the entity (span) level — a prediction counts as correct only if the entire span and its type match, which is stricter than per-token accuracy.

For generation tasks, exact-match accuracy is meaningless (many correct phrasings exist), so we use overlap or similarity metrics:

Metric	Used for	What it measures	Key weakness
Perplexity	language modeling	how well next-word is predicted	not comparable across tokenizers
BLEU	translation	n-gram precision + brevity penalty	misses paraphrases
ROUGE	summarization	n-gram recall vs. reference	surface overlap only
METEOR	translation	matches synonyms/stems too	needs linguistic resources
BERTScore	generation	embedding (semantic) similarity	needs a model, less interpretable

The split between BLEU and ROUGE is worth internalizing: translation cares about precision (don’t add wrong words → BLEU), while summarization cares about recall (don’t miss key content → ROUGE). They are mirror images of the same n-gram idea.

The deepest caution: all automatic generation metrics are proxies. A fluent, correct answer that happens to share few n-grams with the reference scores poorly under BLEU/ROUGE. That is why serious NLP evaluation still leans on human evaluation (rating fluency, adequacy, helpfulness) and, increasingly, LLM-as-a-judge — using a strong model to score outputs. Automatic metrics are for fast iteration; human judgment is the ground truth.

Warning

Never report a single metric on an imbalanced task. A spam classifier that labels everything “ham” can hit 95% accuracy while catching zero spam — its recall is 0. Always pair accuracy with precision/recall or F1, and report on a held-out test set (Model Evaluation & Tuning).

20.13 — Semantic search & vector databases

TF-IDF (Section 20.2) finds documents that share the exact words of your query — search “automobile” and a page that only says “car” is invisible to it. Semantic search fixes this by matching on meaning instead of surface words: embed the query and every document into the same vector space (Section 20.3), then return the documents whose vectors point most nearly in the query’s direction. The intuition: instead of matching keywords, you place every document as a pin on a map of meaning and grab the pins nearest to where the query lands — synonyms and paraphrases sit close together, so “how do I reset my password” finds a doc titled “account recovery steps.”

The score is the same cosine similarity from Section 20.3, now between a query vector \(q\) and a document vector \(d\):

\[ \text{score}(q, d) = \frac{q \cdot d}{\lVert q\rVert\,\lVert d\rVert}. \]

In words: rank documents by how closely their meaning-vector points the same way as the query’s meaning-vector; the smaller the angle, the more relevant. Also written: if the vectors are pre-normalized to unit length, this collapses to a plain dot product, \(\text{score}(q,d) = \hat{q}\cdot\hat{d}\) — which is why production systems normalize once and store \(\hat{d}\).

The modern document embedder is not word2vec but a sentence/passage encoder (e.g., a Sentence-BERT model) that maps a whole chunk of text to one vector. The flow is: embed your corpus once offline, store the vectors, then at query time embed the query and find its nearest neighbors.

flowchart LR
  subgraph Offline indexing
    C[Documents] --> E1[Embed each<br/>passage] --> V[(Vector index)]
  end
  Q[User query] --> E2[Embed query] --> NN[Nearest-neighbor<br/>search in V]
  V --> NN
  NN --> R[Top-k relevant<br/>passages]

Scanning every vector is fine for thousands of documents but hopeless for millions. Vector databases (FAISS, Pinecone, Milvus, pgvector) solve this with approximate nearest neighbor (ANN) indexes — structures like HNSW graphs that find almost the closest vectors in near-constant time, trading a sliver of recall for an enormous speedup. This indexed semantic search is the retrieval half of RAG (Section 20.10): the documents it returns become the grounding an LLM writes its answer from.

A minimal end-to-end semantic search, embedding with sentence-transformers and searching with FAISS:

from sentence_transformers import SentenceTransformer
import faiss, numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")
docs = ["how to reset your password", "store hours and locations", "refund policy details"]
emb = model.encode(docs, normalize_embeddings=True)        # unit vectors

index = faiss.IndexFlatIP(emb.shape[1])                    # inner product = cosine (normalized)
index.add(np.asarray(emb))

q = model.encode(["I forgot my login"], normalize_embeddings=True)
scores, ids = index.search(np.asarray(q), k=1)
print(docs[ids[0][0]])     # -> 'how to reset your password'  (no shared keywords!)

Tip

Hybrid search — combining keyword (BM25/TF-IDF) and semantic (vector) retrieval, then merging the rankings — usually beats either alone: keywords nail exact terms and rare names, embeddings catch paraphrases and synonyms.

20.14 — Bias, fairness & safety in NLP

It is tempting to treat a model’s output as neutral arithmetic, but a language model is a mirror of its training text — and that text carries every prejudice, stereotype, and gap of the people who wrote it. Garbage in, garbage out has a social edge here: bias in, bias out. Before deploying any NLP system that touches people, you have to ask not just “is it accurate?” but “is it fair, and is it safe?”

The clearest illustration is hiding inside the very embeddings of Section 20.3. Because analogies are vector arithmetic, you can ask the model to complete *“man is to computer-programmer as woman is to ___“* — and classic word2vec/GloVe vectors famously answered “homemaker”. The geometry didn’t invent that; it faithfully captured a skew in the corpus. The same mechanism that gives us \(\vec{king}-\vec{man}+\vec{woman}\approx\vec{queen}\) also encodes occupational and racial stereotypes.

Where bias creeps in, and what to do:

Data bias — the corpus over-represents some groups, dialects, or viewpoints. Mitigation: audit and rebalance training data; document it (a “datasheet” / “model card”).
Representational bias — embeddings and models associate identities with stereotypes. Mitigation: debiasing techniques, counterfactual data augmentation (swap “he”↔︎“she” and retrain), bias probes.
Allocational harm — a biased classifier makes unfair decisions (résumé screening, loan-related text, content moderation that silences a dialect). Mitigation: evaluate metrics per subgroup, not just in aggregate.

A small worked check — subgroup evaluation. Suppose a toxicity classifier reports 92% accuracy overall. Split the test set by the dialect of the comment:

Group	Accuracy	False-positive rate (flagged as toxic, but wasn’t)
Group A	94%	4%
Group B	88%	19%

The headline 92% hid a real harm: Group B’s harmless comments are flagged nearly 5× as often. Aggregate metrics can pass while a subgroup quietly suffers — which is exactly the failure mode fairness evaluation exists to catch.

Beyond fairness sit two more safety concerns that any generative deployment must plan for:

Hallucination — generative models state false things fluently and confidently. Grounding with retrieval (RAG, Section 20.10) and citing sources is the main defense.
Toxicity & misuse — models can produce harmful, hateful, or unsafe text, and can be prompted to. Safety filters, content moderation, and alignment training (covered in Large Language Models) are the guardrails.

Warning

A model that is accurate on average can still be unfair to a subgroup. Fairness is not a single number — slice every evaluation by the protected groups your system will affect, and look at error types (false positives vs. false negatives), not just overall accuracy.

20.15 — Evaluation metrics in code

Having mapped the metrics conceptually (Section 20.11), it helps to see them computed with real tooling rather than by hand — in practice you never re-derive precision or BLEU from scratch.

# Classification / sequence-labeling metrics with scikit-learn
from sklearn.metrics import precision_recall_fscore_support, classification_report

y_true = ["pos", "neg", "pos", "pos", "neg"]
y_pred = ["pos", "neg", "neg", "pos", "neg"]
p, r, f1, _ = precision_recall_fscore_support(y_true, y_pred, average="macro")
print(f"P={p:.2f} R={r:.2f} F1={f1:.2f}")
print(classification_report(y_true, y_pred))   # per-class P/R/F1 in one table

For NER, the seqeval library scores at the entity-span level (the strict version from Section 20.11), and for generation, libraries wrap BLEU/ROUGE/BERTScore so you pass strings and get scores:

# ROUGE for summarization (recall-oriented n-gram overlap)
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
s = scorer.score("the cat sat on the mat", "a cat was on the mat")
print(s["rougeL"].fmeasure)   # longest-common-subsequence F-measure

The point is reflexive metric hygiene: pick the metric that matches the task (Section 20.11), compute it with a trusted library, slice it by subgroup (Section 20.13), and only then trust the number.

20.16 — Quick reference

Term / formula	One-line meaning	When / why
Tokenization	Slice text into tokens (word / subword / char)	Always first; subword (BPE/WordPiece) for neural models
BPE	Greedily merge the most frequent adjacent token pair	Builds a subword vocab that has no out-of-vocabulary words
Bag of Words	Count each vocab word, ignore order	Sparse baseline vector for search / linear classifiers
TF-IDF \(=\text{tf}\cdot\text{idf}\)	Weight up words frequent here but rare across the corpus	Surfaces a document’s distinctive words; needs many docs
n-gram	Count adjacent runs of \(n\) words	Cheaply puts a little word-order back into BoW
Word embedding	Dense vector; similar-context words land near each other	Captures meaning; analogies become vector arithmetic
Cosine similarity \(\frac{a\cdot b}{\lVert a\rVert\lVert b\rVert}\)	Angle between two vectors, ignoring length	Standard way to compare embeddings / rank search hits
Language model \(P(w_i\mid w_{<i})\)	Probability of the next word given history	Powers autocomplete, ASR, translation, LLMs
Perplexity \(\exp(-\frac1N\sum\ln P)\)	Effective number of choices per word; lower better	Scores an LM; only comparable within one tokenizer
Smoothing (add-1)	Give unseen pairs a starter count so \(P\neq 0\)	Stops a single unseen n-gram zeroing a sentence
BIO / CRF	Span tags + label-transition scoring	Sequence labeling (NER, POS) with legal transitions
Naive Bayes	Each word votes independently via Bayes’ rule	Fast, interpretable sentiment/text-classification baseline
LDA	Docs = mixtures of topics, topics = word distributions	Unsupervised topic discovery on unlabeled corpora
seq2seq + attention	Encoder reads source, decoder writes target	Neural translation/summarization; attention beats the bottleneck
BLEU \(=\text{BP}\cdot\prod p_n^{w_n}\)	n-gram precision \(\times\) brevity penalty	Corpus-level MT metric; misses paraphrases
Decoding	Policy for choosing the next token	Beam for one-right-answer; temperature/top-p for creativity
Temperature \(\text{softmax}(z/T)\)	Flatten (\(T{>}1\)) or sharpen (\(T{<}1\)) the distribution	Tunes randomness of sampled generation
RAG	Retrieve relevant docs, then generate grounded answer	Combines retrieval’s facts with generation’s fluency
Semantic search	Embed query + corpus, rank by cosine	Matches meaning, not keywords; scale with vector DBs / ANN
F1 \(\frac{2PR}{P+R}\)	Harmonic mean of precision and recall	Default labeling/classification metric on imbalanced data

20.17 — Key takeaways

Text must become numbers first. Tokenization (increasingly subword BPE/WordPiece) plus optional stemming/lemmatization/stop-word handling is the unavoidable front door of every NLP pipeline.
Representations climbed a ladder of meaning: raw counts (BoW) → frequency-weighted counts (TF-IDF) → dense semantic vectors (word2vec/GloVe) → contextual vectors (transformers); n-grams add a little local word-order back to the order-blind models.
Embeddings encode meaning geometrically — similar words sit close, and analogies become vector arithmetic (\(\vec{king}-\vec{man}+\vec{woman}\approx\vec{queen}\)) — which also means they inherit and reflect the biases in their training text, and power semantic search (embed query + corpus, rank by cosine, scale with vector databases).
A language model predicts the next token; n-grams count, neural LMs generalize, and perplexity scores prediction (lower = better, but tokenizer-dependent).
Generation needs a decoding strategy: greedy and beam search for one-right-answer tasks (translation, summarization), temperature/top-p sampling for creative, diverse output.
Sequence labeling (POS, NER with BIO tags) needs label-transition structure (CRF) on top of per-token scores; classification (sentiment, intent) maps a whole document to a category.
Translation went statistical → neural seq2seq → attention, evaluated by BLEU (n-gram precision + brevity penalty); summarization splits into extractive (copy) vs. abstractive (rewrite).
Dialogue spans retrieval (safe, fixed) to generative (fluent, risky), with RAG blending them; task bots decompose into intent + slots.
Match the metric to the task: F1 for labeling, BLEU for translation, ROUGE for summarization, and human/LLM judgment as the real ground truth — and always slice metrics by subgroup to catch fairness failures aggregate numbers hide.

20.18 — See also

Recurrent & Sequence Models — the RNN/LSTM backbone behind seq2seq, BiLSTM tagging, and neural LMs.
Attention & Transformers — contextual embeddings and the architecture that now dominates NLP.
Large Language Models — instruction tuning, RAG, function calling, and alignment/safety that power modern chatbots.
Generative Models — the broader family of sequence and text generation.
Speech & Audio Processing — turning audio into text that feeds these NLP pipelines.
Model Evaluation & Tuning — precision/recall/F1, confusion matrices, and cross-validation in depth.
Dimensionality Reduction — the embedding and matrix-factorization ideas underlying word vectors and topic models.
Multimodal AI — combining language with vision and other modalities.

↪ The thread continues → Chapter 21 · 🔊 Speech & Audio Processing

Language lives not only in text but in sound. Speech and audio extend our sequence models to waveforms — recognition and synthesis.

📖 All chapters | ← 19 · 👁️ Computer Vision | 21 · 🔊 Speech & Audio Processing →