Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • AI & ML Encyclopedia
  • Extra
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this chapter

  • 43.1 The inverted index and Boolean retrieval
  • 43.2 The vector-space model and TF-IDF
  • 43.3 BM25 and BM25F
  • 43.4 Query expansion and relevance feedback
  • 43.5 Evaluating retrieval: precision, recall, MAP, nDCG
  • 43.6 Learning to rank and LambdaMART
  • 43.7 Neural IR: dense retrieval, late interaction, learned sparse
  • 43.8 Link analysis: PageRank and HITS
  • 43.9 Near-duplicate detection: shingling, MinHash, LSH
  • 43.10 Mining streams with sublinear memory
  • 43.11 Frequent itemsets and the PCY refinement
  • 43.12 Submodular optimization for coverage and influence
  • 43.13 CUR decomposition: interpretable low-rank approximation
  • 43.1 — Putting it together: index-and-retrieve
  • 43.2 — Quick reference
  • 43.3 — Key takeaways
  • 43.4 — See also

Chapter 43 — 🔎 Information Retrieval & Data Mining

📖 All chapters  |  ← 42 · 📐 Learning Theory  |  44 · 🏗️ LLM Systems: Building LLMs from Scratch →

📚 Jump to any chapter

🧮 Mathematical Foundations

  • 01 · 🧮 Linear Algebra
  • 02 · ∂ Calculus & Differentiation
  • 03 · 📉 Optimization
  • 04 · 🎲 Probability & Statistics

🧭 The ML Workflow

  • 05 · 🌐 AI, ML & the Learning Process
  • 06 · 🧹 Data Preprocessing
  • 07 · 🗜️ Dimensionality Reduction

🧩 Classical Machine Learning

  • 08 · 📈 Regression
  • 09 · 📐 Classification Algorithms
  • 10 · 🌳 Ensemble Methods
  • 11 · 🔮 Clustering & Unsupervised Learning
  • 12 · 🎯 Model Evaluation & Tuning

🎲 Probabilistic Models

  • 13 · 🕸️ Probabilistic Graphical Models

🧠 Deep Learning

  • 14 · 🧠 Neural Networks (Core)
  • 15 · 🖼️ Convolutional Neural Networks
  • 16 · 🔁 Recurrent & Sequence Models
  • 17 · ⚡ Attention & Transformers
  • 18 · 🎨 Generative Models

🗣️ Applied AI: Vision, Language, Audio & Time

  • 19 · 👁️ Computer Vision
  • 20 · 💬 Natural Language Processing
  • 21 · 🔊 Speech & Audio Processing
  • 22 · ⏳ Time Series & Forecasting
  • 23 · 📚 Large Language Models
  • 24 · 🌈 Multimodal AI

🕹️ Reinforcement Learning

  • 25 · 🕹️ Reinforcement Learning

🛠️ Applied ML Systems & Industries

  • 26 · 🛒 Recommender Systems
  • 27 · 🚨 Anomaly & Fraud Detection
  • 28 · 🏦 ML Across Industries

🚀 Production, Tooling & Infrastructure

  • 29 · 🔧 MLOps & Deployment
  • 30 · 🚀 AI Infrastructure & Efficient Inference
  • 31 · 🧰 Tools & Frameworks

📚 Classical & Symbolic AI

  • 32 · 🧭 Search & Problem Solving
  • 33 · 📖 Knowledge Representation & Reasoning
  • 34 · 🗺️ Planning, Constraint Satisfaction & Game Playing
  • 35 · 🧬 Evolutionary Computation & Metaheuristics

⚖️ Responsible AI & Frontier

  • 36 · 🔍 Explainable AI & Interpretability
  • 37 · 🧷 Causal Inference
  • 38 · ⚖️ AI Ethics, Fairness & Safety
  • 39 · 🌠 Frontier & Emerging Directions

🎓 Advanced & Specialized Topics

  • 40 · 🔗 Graph Machine Learning
  • 41 · 🤖 Robotics & Autonomy
  • 42 · 📐 Learning Theory
  • 43 · 🔎 Information Retrieval & Data Mining
  • 44 · 🏗️ LLM Systems: Building LLMs from Scratch

🎚️ Post-Training & Fine-Tuning

  • 45 · 🎚️ Post-Training I — Transfer, Fine-Tuning & PEFT
  • 46 · 🏅 Post-Training II — Alignment & Evaluation

🚢 Model Serving & Deployment

  • 47 · 🚢 Model Serving & Deployment in Production

How do you find a handful of relevant documents among billions, or count distinct visitors to a website without storing every ID? Information retrieval (IR) and data mining answer two halves of the same problem: matching a query against a corpus, and extracting structure from data too large to fit in memory. This chapter walks from the classical inverted index through neural retrieval, then turns to the sublinear-memory algorithms that make web-scale mining possible.

🧭 In context: search & large-scale mining · find relevant documents and summarize massive streams · build an index once, then rank candidates cheaply and approximate what you cannot store exactly

💡 Remember this: retrieval and mining at scale are the same trick — build a cheap index or sketch offline so that online you only ever touch a tiny, relevant slice of the data, never the whole thing.

43.1 The inverted index and Boolean retrieval

The naive way to search a corpus is to scan every document for the query terms. At web scale that is hopeless. The fix is an inverted index: instead of mapping documents to their words (the forward index), you map each word to the list of documents that contain it. That per-term list is called a postings list.

Think of the index at the back of a textbook. You do not read every page looking for “entropy”; you flip to the index, find “entropy → 142, 197, 301”, and jump straight there. The inverted index is exactly this, built for every term in the vocabulary.

Here is the forward-to-inverted flip as a picture — documents fan out into words, then the arrows reverse so each word gathers its documents:

forward inverted doc 1 doc 2 doc 3 catsatdog catsatdog 1, 3 1, 2 2

Tiny worked example. Three documents:

docID text
1 the cat sat
2 the dog sat
3 the cat ran

After lowercasing and tokenizing, the dictionary maps each term to a postings list of docIDs:

term postings
the 1, 2, 3
cat 1, 3
sat 1, 2
dog 2
ran 3

A Boolean query combines terms with AND, OR, NOT by intersecting, unioning, or differencing postings lists. The query cat AND sat intersects {1,3} with {1,2} to return {1}. Because postings are kept sorted by docID, intersection is a linear merge — walk two pointers, advance the smaller, emit on a match — in \(O(|A|+|B|)\) rather than \(O(|A|\cdot|B|)\).

Here is the two-pointer merge as a picture — two sorted lists, fingers walking down, emitting only where they meet:

A: cat B: sat 1 3 1 2 1 = 1 → emit ✓ then 3 vs 2: advance smaller
# merge-intersect two sorted postings lists
def intersect(a, b):
    i = j = 0; out = []
    while i < len(a) and j < len(b):
        if a[i] == b[j]: out.append(a[i]); i += 1; j += 1
        elif a[i] < b[j]: i += 1          # advance the smaller
        else: j += 1
    return out

assert intersect([1,3], [1,2]) == [1]     # cat AND sat -> doc 1

Index construction at scale uses block-based sort-merge: parse documents into (term, docID) pairs, sort, and group. When the corpus exceeds memory, sort blocks on disk and merge them (BSBI/SPIMI). The dictionary itself is kept in memory; postings live on disk.

In words: the formula for the cost of intersection, \(O(|A|+|B|)\), says we touch each posting in each list at most once — there is no nested loop comparing every pair. Also written: \(O(n)\) where \(n = |A| + |B|\) is the total length of both postings lists.

Tip

When intersecting many terms, start with the shortest postings list (the rarest term) and intersect against it. The result can only shrink, so each later merge runs over a smaller list — the rare term does the pruning for free.

Postings compression

Postings lists are huge, so we compress them. Two ideas do most of the work. First, store gaps (delta encoding) instead of absolute docIDs: a postings list 824, 829, 1001 becomes 824, 5, 172. Gaps are small for frequent terms, and small numbers compress well. Second, encode those gaps with a variable-byte or bit-level code (Elias-gamma, or modern SIMD-friendly schemes like PForDelta) so a gap of 5 costs one byte, not four.

Tip

Gap encoding works because a sorted list of increasing integers has small successive differences. The more frequent the term, the denser the postings, the smaller the gaps — so the most expensive lists compress the best.

Phrase and proximity queries: positional indexes

Boolean cat AND sat finds documents with both words anywhere. But searching for the phrase “new york” should not match a document about a “new park in york.” The fix is a positional index: each posting stores not just the docID but the list of positions where the term occurs.

The intuition: knowing a word is present is not enough; you need to know where so you can check that query words sit next to each other. Extend the postings to term → docID: [pos1, pos2, ...]:

term postings (docID: positions)
new 1: [0, 17], 4: [3]
york 1: [1], 4: [9]

A phrase query “new york” intersects the docID lists as before, then for each shared doc checks whether some position of “york” equals a position of “new” plus one. In doc 1, “new” at 0 and “york” at 1 → adjacent → match. In doc 4, “new” at 3 and “york” at 9 → 6 apart → no phrase match. A proximity query (“new” NEAR/3 “york”) relaxes “+1” to “within \(k\).” Positional indexes are larger — every occurrence is stored — but they unlock phrase, proximity, and the field-position features that downstream rankers love.

43.2 The vector-space model and TF-IDF

Boolean retrieval returns a set — a document either matches or it does not — with no notion of which match is best. Real search needs ranking. The vector-space model treats each document and the query as a vector in a space with one dimension per vocabulary term, then ranks documents by how close their vector points to the query’s.

The everyday picture: imagine every document as an arrow in a giant space where each axis is a word. Two documents about the same topic point in nearly the same direction, even if one is a paragraph and the other a book. Ranking is just measuring the angle between the query’s arrow and each document’s arrow.

Here are three documents as arrows on two word-axes — same topic means same direction, no matter the length:

“cat” “dog” doc A (cat-heavy) doc B (same mix, shorter) doc C (dog-heavy) small angle = similar

The weight on each term needs to capture two intuitions:

  • A term that appears often in a document is more important to that document → term frequency \(\text{tf}_{t,d}\).
  • A term that appears in almost every document (“the”, “is”) carries little discriminating power → down-weight by document frequency.

This gives TF-IDF. With \(N\) documents and \(\text{df}_t\) documents containing term \(t\):

\[ \text{idf}_t = \log \frac{N}{\text{df}_t}, \qquad w_{t,d} = (1 + \log \text{tf}_{t,d}) \cdot \text{idf}_t \]

In words: a term’s weight in a document is how much it shows up here (with diminishing returns from the log) times how rare it is across the whole corpus (the idf). Also written: \(w_{t,d} = (1 + \log \text{tf}_{t,d}) \cdot \log(N / \text{df}_t)\), i.e. substitute the idf definition directly into the weight.

The log on tf reflects diminishing returns (the 10th occurrence adds less than the 2nd); the idf log dampens the dominance of common terms.

Documents are ranked by cosine similarity between query and document vectors:

\[ \cos(\mathbf{q}, \mathbf{d}) = \frac{\mathbf{q} \cdot \mathbf{d}}{\|\mathbf{q}\|\,\|\mathbf{d}\|} \]

In words: line up the two vectors and ask how aligned are they? — 1 means same direction (identical term mix), 0 means no shared terms; the denominator cancels out document length. Also written: \(\cos(\mathbf{q},\mathbf{d}) = \hat{\mathbf{q}} \cdot \hat{\mathbf{d}}\) where \(\hat{\mathbf{q}} = \mathbf{q}/\|\mathbf{q}\|\) and \(\hat{\mathbf{d}} = \mathbf{d}/\|\mathbf{d}\|\) are the unit-normalized vectors — cosine is just the dot product of normalized vectors.

Cosine, not raw dot product, because we want direction (which terms, in what proportion) not magnitude (a long document should not win just for being long).

Worked example. Corpus of \(N=3\) docs from §43.1. The term “cat” appears in 2 docs, so \(\text{idf}_{\text{cat}} = \log(3/2) \approx 0.18\). The term “dog” appears in 1 doc, so \(\text{idf}_{\text{dog}} = \log(3/1) \approx 0.48\) — “dog” is rarer, hence more informative, hence weighted higher. The term “the” appears in all 3, so \(\text{idf}_{\text{the}} = \log(3/3) = 0\) — it vanishes from ranking entirely, exactly as it should.

tf weight (1 + log tf) — diminishing returns raw tf — linear, unbounded

Doing it with scikit-learn. In practice you never build TF-IDF by hand — TfidfVectorizer builds the term–document matrix and cosine_similarity ranks:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

docs = ["the cat sat", "the dog sat", "the cat ran"]
vec = TfidfVectorizer()                 # lowercases + tokenizes for you
X = vec.fit_transform(docs)             # sparse (n_docs x n_terms) TF-IDF matrix
q = vec.transform(["cat ran"])          # query uses the SAME vocabulary
scores = cosine_similarity(q, X).ravel()
ranking = scores.argsort()[::-1]        # doc 3 ("the cat ran") ranks first
print(ranking, scores.round(2))

43.3 BM25 and BM25F

TF-IDF works, but its tf term grows without bound and it ignores document length crudely. BM25 (Best Match 25) is the probabilistic-retrieval refinement that has been the production default for two decades. It fixes two things: tf saturation and length normalization.

For query \(q\) with terms \(t\), document \(d\) of length \(|d|\), average document length \(\text{avgdl}\):

\[ \text{BM25}(q,d) = \sum_{t \in q} \text{idf}_t \cdot \frac{\text{tf}_{t,d}\,(k_1 + 1)}{\text{tf}_{t,d} + k_1\!\left(1 - b + b\,\dfrac{|d|}{\text{avgdl}}\right)} \]

In words: for each query term, score how rare it is times a saturating function of how often it appears here, with a penalty for the document being longer than average; sum those over the query terms. Also written: \(\text{BM25}(q,d) = \sum_{t \in q} \text{idf}_t \cdot \dfrac{\text{tf}_{t,d}}{\text{tf}_{t,d}/(k_1+1) + K/(k_1+1)}\) with \(K = k_1(1 - b + b\,|d|/\text{avgdl})\) — the same fraction, scaled so the saturation ceiling reads off directly.

Two intuitions live in the formula:

  • Saturation via \(k_1\) (typically 1.2–2.0): the fraction approaches the ceiling \(k_1+1\) as tf grows, so the 20th occurrence of a term barely beats the 5th. Raw TF-IDF would keep rewarding it linearly.
  • Length normalization via \(b\) (typically 0.75): the \(b\cdot|d|/\text{avgdl}\) term penalizes long documents, because a long document accumulates term occurrences just by being long. \(b=0\) disables it; \(b=1\) fully normalizes.

Here is saturation as a picture — BM25’s curve climbs fast, then flattens at its ceiling, while raw TF-IDF keeps rising forever:

tf → score ceiling k₁+1 BM25 — saturates raw tf — unbounded

The idf in BM25 is the probabilistic form \(\log\frac{N - \text{df}_t + 0.5}{\text{df}_t + 0.5}\).

Worked example. Take \(k_1 = 1.2\), \(b = 0.75\), a term with \(\text{idf}=2.0\), and a document of average length (\(|d| = \text{avgdl}\), so the bracket is just \(1\)).

tf tf-contribution \(\frac{\text{tf}(k_1+1)}{\text{tf}+k_1}\) × idf
1 \(\frac{1\cdot2.2}{1+1.2}=1.00\) 2.00
3 \(\frac{3\cdot2.2}{3+1.2}=1.57\) 3.14
10 \(\frac{10\cdot2.2}{10+1.2}=1.96\) 3.93
100 \(\frac{100\cdot2.2}{100+1.2}=2.17\) 4.35

Going from tf=10 to tf=100 — a 10× jump — barely moves the score (3.93 → 4.35). That is saturation doing its job; TF-IDF’s \((1+\log\text{tf})\) would have grown much faster and unboundedly.

Using a real engine. Production search uses BM25 out of the box. With the rank-bm25 library you can score a toy corpus in a few lines:

from rank_bm25 import BM25Okapi
corpus = [d.split() for d in ["the cat sat", "the dog sat", "the cat ran"]]
bm25 = BM25Okapi(corpus, k1=1.2, b=0.75)     # builds df, avgdl, idf internally
print(bm25.get_scores("cat ran".split()))    # one BM25 score per document
# Lucene / Elasticsearch / OpenSearch use BM25 as the default similarity.

BM25F extends this to structured (fielded) documents — title, body, anchor text. Instead of scoring fields separately and summing (which double-counts saturation), BM25F combines field term frequencies into a single weighted pseudo-frequency before applying saturation:

\[ \tilde{\text{tf}}_{t,d} = \sum_{f} w_f \cdot \frac{\text{tf}_{t,f}}{1 - b_f + b_f\,\frac{|d_f|}{\text{avgdl}_f}} \]

In words: before saturating, build one combined term-frequency that adds up the term’s count in each field — title, body, anchors — weighted by how much that field matters and length-normalized per field. Also written: \(\tilde{\text{tf}}_{t,d} = \sum_f w_f\, \text{tf}_{t,f} / B_f\) where \(B_f = 1 - b_f + b_f\,|d_f|/\text{avgdl}_f\) is the per-field length normalizer.

then feeds \(\tilde{\text{tf}}\) through the BM25 saturation. The point: a term in the title (\(w_{\text{title}}\) large) counts more than the same term buried in the body, with per-field length normalization \(b_f\).

Warning

Don’t run BM25 per field and add the scores. Saturation is nonlinear, so summing per-field BM25 lets a term that appears a little in every field rack up a high total. BM25F deliberately pools the frequencies first, then saturates once.

43.4 Query expansion and relevance feedback

A query is a few words; the relevant documents may phrase the idea differently. Query expansion widens the query with related terms to bridge that gap — the lightweight cousin of the neural vocabulary-mismatch fixes in §43.7.

The everyday picture: you search “car insurance,” and the engine quietly also looks for “auto,” “vehicle,” “coverage,” because it has learned those travel together. You get the documents you meant, not just the ones that used your exact words.

Relevance feedback uses the user’s (or a model’s) judgment of the first results to refine the query. The classic Rocchio update, in the vector-space model, pushes the query vector toward documents marked relevant and away from non-relevant ones:

\[ \mathbf{q}_{\text{new}} = \alpha\,\mathbf{q}_{0} + \beta\,\frac{1}{|D_r|}\sum_{\mathbf{d}\in D_r}\mathbf{d} \;-\; \gamma\,\frac{1}{|D_{nr}|}\sum_{\mathbf{d}\in D_{nr}}\mathbf{d} \]

In words: the new query is the old query, plus a nudge toward the average relevant document, minus a nudge away from the average irrelevant one. Also written: \(\mathbf{q}_{\text{new}} = \alpha\,\mathbf{q}_0 + \beta\,\bar{\mathbf{d}}_r - \gamma\,\bar{\mathbf{d}}_{nr}\), where \(\bar{\mathbf{d}}_r\) and \(\bar{\mathbf{d}}_{nr}\) are the centroids of the relevant and non-relevant sets.

When no human is in the loop, pseudo-relevance feedback (blind feedback) just assumes the top-\(k\) results are relevant, expands from their frequent terms, and re-runs — often a free quality bump, at the risk of query drift if the top-\(k\) were off-topic. Modern systems add synonym expansion (WordNet, learned embeddings) and spelling/normalization; LLM-based query rewriting (HyDE: generate a hypothetical answer document, then retrieve with its embedding) is the current incarnation of the same idea.

q₀ relevant non-rel q_new Rocchio pulls the query toward relevant, away from non-relevant

43.5 Evaluating retrieval: precision, recall, MAP, nDCG

You cannot improve a ranker you cannot measure. IR evaluation needs relevance judgments — for each query, which documents are relevant — then scores how well the ranking surfaces them.

Precision and recall are the base. For a returned set:

\[ \text{Precision} = \frac{\text{relevant retrieved}}{\text{retrieved}}, \qquad \text{Recall} = \frac{\text{relevant retrieved}}{\text{all relevant}} \]

In words: precision is how clean was what I showed; recall is how much of the good stuff did I catch. Also written: with true positives \(tp\), false positives \(fp\), false negatives \(fn\): \(\text{Precision} = tp/(tp+fp)\), \(\text{Recall} = tp/(tp+fn)\).

Precision asks “of what I showed, how much was good?”; recall asks “of all the good stuff, how much did I find?”. (See Model Evaluation for the classification view.) But ranking quality is about order, so we need order-aware metrics.

Average Precision (AP) rewards putting relevant documents early. Compute precision at each rank where a relevant document appears, then average. MAP is the mean of AP across queries.

Worked AP. A query has 3 relevant documents in the corpus. The ranker returns 6 results; relevant ones are marked ✓:

rank 1 2 3 4 5 6
relevant? ✓ ✗ ✓ ✗ ✗ ✓
precision@rank 1/1 — 2/3 — — 3/6

\[ \text{AP} = \frac{1.0 + 0.667 + 0.5}{3} = 0.722 \]

We divide by 3 (total relevant), so missing a relevant document entirely is penalized — it contributes 0 to the sum.

nDCG (normalized Discounted Cumulative Gain) handles graded relevance (not just yes/no — a result can be perfect=3, good=2, fair=1, irrelevant=0) and position discounting (a great result at rank 1 is worth more than at rank 10):

\[ \text{DCG@}k = \sum_{i=1}^{k} \frac{2^{\text{rel}_i} - 1}{\log_2(i+1)}, \qquad \text{nDCG@}k = \frac{\text{DCG@}k}{\text{IDCG@}k} \]

In words: add up each result’s usefulness (\(2^{\text{rel}}-1\), so higher grades count a lot more), shrinking each by how far down the list it sits, then divide by the score of the best-possible ordering so it lands between 0 and 1. Also written: \(\text{DCG@}k = \sum_{i=1}^{k} g_i \cdot d_i\) with gain \(g_i = 2^{\text{rel}_i}-1\) and discount \(d_i = 1/\log_2(i+1)\) — a gain term times a position-discount term.

IDCG is the DCG of the ideal ranking (relevances sorted descending), so nDCG lands in \([0,1]\) and is comparable across queries.

Worked nDCG@3. Ranking produces relevance grades \([3, 1, 2]\).

\[ \text{DCG} = \frac{2^3-1}{\log_2 2} + \frac{2^1-1}{\log_2 3} + \frac{2^2-1}{\log_2 4} = \frac{7}{1} + \frac{1}{1.585} + \frac{3}{2} = 9.13 \]

Ideal order is \([3,2,1]\):

\[ \text{IDCG} = \frac{7}{1} + \frac{3}{1.585} + \frac{1}{2} = 9.39, \qquad \text{nDCG} = \frac{9.13}{9.39} = 0.97 \]

The ranking is nearly ideal — it only swapped the 2nd and 3rd results, which the log discount makes cheap.

The position discount is what makes “great result early” pay off. Here is the discount curve \(1/\log_2(i+1)\) — steep at the top, nearly flat by the bottom of the page:

rank position i discount 1234 5678 rank 1 worth ~2× rank 3
import numpy as np
def dcg(rels):
    rels = np.asarray(rels, float)
    return np.sum((2**rels - 1) / np.log2(np.arange(2, len(rels)+2)))
def ndcg(rels):
    return dcg(rels) / dcg(sorted(rels, reverse=True))
assert abs(ndcg([3,1,2]) - 0.97) < 0.01

With scikit-learn. ndcg_score does the same, taking true relevance grades and predicted scores:

from sklearn.metrics import ndcg_score
import numpy as np
true_rel = np.array([[3, 1, 2]])        # graded relevance of 3 docs
scores   = np.array([[0.9, 0.4, 0.6]])  # ranker's scores (same order)
print(round(ndcg_score(true_rel, scores), 2))   # 0.97

43.6 Learning to rank and LambdaMART

BM25 has a handful of hand-tuned knobs. Modern search blends hundreds of signals — BM25 over several fields, freshness, click-through rate, PageRank, query-document embeddings. Learning to rank (LTR) trains a model to combine them, supervised by relevance judgments. The three families differ in what the loss looks at:

flowchart LR
    P["Pointwise<br/>predict each doc's score<br/>regression/classification"] --> X[loss sees ONE doc]
    A["Pairwise<br/>which of two docs<br/>ranks higher?"] --> Y[loss sees a PAIR]
    L["Listwise<br/>optimize the whole<br/>ranked list directly"] --> Z[loss sees the LIST + metric]

  • Pointwise treats each (query, doc) independently and predicts a relevance score, then sorts. Simple, but it optimizes absolute scores, not order — and getting absolute scores right is harder than getting order right.
  • Pairwise learns from preferences: for two documents under the same query, predict which should rank higher. RankNet’s loss is the cross-entropy on the probability that doc \(i\) beats doc \(j\). This matches what we care about (order) much better.
  • Listwise optimizes a loss tied to the whole list and ideally the target metric (nDCG) directly. Hardest, usually best.

LambdaMART is the workhorse that won many LTR competitions and still anchors production rankers. It is gradient-boosted regression trees (MART, see Ensemble Methods) trained with lambda gradients. The trick solves a real problem: nDCG is flat or discontinuous in the model scores (swapping two docs changes nDCG in a step), so you cannot take its gradient directly.

LambdaMART sidesteps this. For each pair of documents \((i,j)\) where \(i\) should outrank \(j\), it defines a “force” — the lambda — that pulls \(i\) up and \(j\) down, scaled by how much swapping them would change nDCG:

\[ \lambda_{ij} = \frac{-1}{1 + e^{\,s_i - s_j}} \cdot \bigl|\Delta\text{nDCG}_{ij}\bigr| \]

In words: the push on a pair is how surprised we are that \(i\) isn’t already clearly above \(j\) (the sigmoid factor) times how much the metric would improve if we got this pair right (the \(\Delta\)nDCG factor). Also written: \(\lambda_{ij} = -\sigma(s_j - s_i)\cdot|\Delta\text{nDCG}_{ij}|\) where \(\sigma(x) = 1/(1+e^{-x})\) is the logistic sigmoid — the RankNet gradient weighted by the metric delta.

The first factor is the pairwise (RankNet) gradient; the \(|\Delta\text{nDCG}|\) factor is the listwise insight — pairs whose swap most damages the metric get the strongest push. Each document’s gradient is the sum of lambdas over all pairs it participates in, and gradient boosting fits trees to those lambdas. The result is a model trained pairwise but aware of a listwise metric.

In a framework. LightGBM and XGBoost both ship LambdaMART under an lambdarank/rank:ndcg objective; you pass a group array telling the model which rows belong to the same query:

import lightgbm as lgb
# X: features, y: relevance grades, group: #docs per query (sums to len(X))
train = lgb.Dataset(X, label=y, group=group)
ranker = lgb.train(
    {"objective": "lambdarank", "metric": "ndcg", "ndcg_eval_at": [5, 10]},
    train, num_boost_round=100,
)
# ranker.predict(X_new) -> scores; sort within each query group to rank.
Warning

The group array is the part everyone gets wrong. If you forget it (or pass the wrong basket sizes), LightGBM treats all rows as one giant query and the lambdas compare documents across unrelated searches — the model trains without error and ranks badly. The group sizes must sum to len(X), in the same row order.

43.7 Neural IR: dense retrieval, late interaction, learned sparse

BM25 matches exact terms. It cannot tell that “car” and “automobile” mean the same thing — this is the vocabulary mismatch problem. Neural IR uses learned representations (neural networks, generative models) to match on meaning. Three architectures dominate, trading off quality against cost.

flowchart TB
    subgraph DR["Dense / Two-tower (DPR)"]
        q1[query] --> qe[encoder] --> qv[1 vector]
        d1[doc] --> de[encoder] --> dv[1 vector]
        qv -. dot product .-> dv
    end
    subgraph LI["Late interaction (ColBERT)"]
        q2[query] --> qt[per-token vectors]
        d2[doc] --> dt[per-token vectors]
        qt -. MaxSim per query token .-> dt
    end
    subgraph LS["Learned sparse (SPLADE)"]
        q3[query] --> qs[weighted term expansion] --> inv[inverted index]
    end

Dense retrieval / two-tower (DPR). Encode the query and each document into a single dense vector (e.g. 768-d) with a BERT-style encoder, and score by dot product. The two towers are independent, so document vectors are precomputed offline and indexed for approximate nearest-neighbor search (HNSW, IVF-PQ — see LLMs & RAG). At query time you encode the query once and do an ANN lookup. Fast and semantic, but compressing a whole document to one vector loses fine-grained term matching.

Late interaction (ColBERT). Keep one vector per token for both query and document. Score with MaxSim: for each query token, take its maximum similarity over all document tokens, then sum:

\[ S(q,d) = \sum_{i \in q} \max_{j \in d} \; \mathbf{v}_i^q \cdot \mathbf{v}_j^d \]

In words: every query word finds its single best-matching word in the document and contributes that match score; add those best matches up. Also written: \(S(q,d) = \sum_{i \in q} \max_{j \in d} \langle \mathbf{v}_i^q, \mathbf{v}_j^d \rangle\), the sum over query tokens of the row-maxima of the query-by-document token similarity matrix.

Here is MaxSim on a tiny grid — each query row lights up its single brightest document cell, and those winners are summed:

document tokens → query tokens row-max row-max row-max S(q,d) = sum of the three green winners

This recovers token-level matching (so “car” can find “automobile” and still reward exact overlaps) while keeping encoding independent enough to precompute document tokens. The cost is storage — many vectors per document.

Learned sparse (SPLADE). Predict a sparse vector over the vocabulary: each document is expanded into a weighted bag of terms, including terms it does not literally contain (learned expansion fixes vocabulary mismatch), with most weights zero. Because the output is sparse and term-indexed, you can serve it on a classic inverted index — getting neural semantics with the mature, fast infrastructure of §43.1.

Building a dense retriever in practice. sentence-transformers wraps a two-tower encoder; you embed the corpus once, then do a cosine/dot-product search per query:

from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("all-MiniLM-L6-v2")
corpus = ["a cat sat on the mat", "automobiles need insurance", "dogs love to run"]
doc_emb = model.encode(corpus, convert_to_tensor=True)        # precompute offline
q_emb   = model.encode("car coverage", convert_to_tensor=True)
hits = util.semantic_search(q_emb, doc_emb, top_k=2)          # ANN-style lookup
print(hits[0])   # ranks "automobiles need insurance" first despite no shared word

These power RAG (retrieval-augmented generation): retrieve passages with dense/sparse/hybrid search, then feed them to an LLM as context. The retriever quality caps the generator — garbage retrieved is garbage generated. See Ch. 23 for the full RAG pipeline.

Tip

In practice the strongest systems are hybrid: BM25 (exact, cheap, great on rare terms and IDs) plus a dense retriever (semantic), with scores fused (e.g. reciprocal-rank fusion) and a cross-encoder reranker on the top-k. Each method covers the other’s blind spot.

43.8 Link analysis: PageRank and HITS

When documents link to each other — web pages, citations, social follows — the link graph itself carries a query-independent signal of importance, on top of the textual match. The founding idea of web search: a page is important if important pages link to it.

The everyday picture: imagine a bored web surfer clicking links at random forever. Pages they end up visiting most often are the “important” ones. PageRank is exactly the long-run fraction of time this random surfer spends on each page.

Here is the random surfer in motion — a token drifting along links, pooling where many arrows point:

A B C D C is biggest: most arrows land here

\[ \text{PR}(p) = \frac{1-d}{N} + d \sum_{q \to p} \frac{\text{PR}(q)}{L(q)} \]

In words: a page’s rank is a small baseline everyone gets (the random teleport), plus a share of the rank flowing in from each page that links to it, split evenly across that linking page’s outgoing links. Also written: in matrix form \(\mathbf{r} = \frac{1-d}{N}\mathbf{1} + d\,M\mathbf{r}\), where \(M_{pq} = 1/L(q)\) if \(q\to p\) — the stationary distribution (dominant eigenvector) of the “random surfer” transition matrix. \(d \approx 0.85\) is the damping factor; the \((1-d)\) teleport keeps the chain from getting stuck in dead ends and guarantees a unique solution.

You solve it by power iteration: start uniform, repeatedly apply the update until the ranks stop moving.

Worked intuition. Three pages: A→B, A→C, B→C, C→A. C is pointed to by both A and B, so rank pools at C; C then feeds A, which feeds B and C again. Iterating converges to C highest, A next, B lowest — matching the eyeball verdict that “most-linked-to wins, but who links matters.”

import numpy as np
M = np.array([[0,0,1],[0.5,0,0],[0.5,1,0]])   # column q -> rows it links to
r = np.ones(3)/3; d = 0.85
for _ in range(50):
    r = (1-d)/3 + d * M @ r                     # power iteration
print(r.round(3))                               # C > A > B

HITS (Hyperlink-Induced Topic Search) splits importance into two roles: hubs (pages that link to many good pages — a directory) and authorities (pages many hubs link to — the canonical source). Each reinforces the other: a good hub points to good authorities, a good authority is pointed to by good hubs. Where PageRank is one global query-independent score, HITS is computed on a query-focused subgraph. The same eigenvector machinery underlies both, and both connect to the graph-centrality ideas in Graph ML.

43.9 Near-duplicate detection: shingling, MinHash, LSH

The web is full of near-duplicates — mirror sites, boilerplate, lightly-edited reposts. Comparing every pair of \(N\) documents is \(O(N^2)\), impossible at scale. The shingling → MinHash → LSH pipeline finds near-duplicate pairs in roughly linear time.

Step 1 — Shingling. Represent each document as the set of its \(k\)-shingles (contiguous \(k\)-token sequences). Document similarity becomes Jaccard similarity of the two sets:

\[ J(A,B) = \frac{|A \cap B|}{|A \cup B|} \]

In words: of all the distinct shingles appearing in either document, what fraction appear in both? Also written: \(J(A,B) = \dfrac{|A \cap B|}{|A| + |B| - |A \cap B|}\) — expand the union via inclusion–exclusion.

Step 2 — MinHash. Jaccard on full shingle sets is expensive to store and compare. MinHash compresses each set to a short signature of \(n\) integers with a magic property: the probability that two sets’ MinHash values agree equals their Jaccard similarity.

\[ \Pr[\,h_{\min}(A) = h_{\min}(B)\,] = J(A,B) \]

In words: under a random shuffle of all possible shingles, the chance that the two sets’ smallest elements coincide is exactly their Jaccard overlap. Also written: \(\mathbb{E}\big[\frac{1}{n}\sum_{k=1}^{n}\mathbb{1}[h^{(k)}_{\min}(A) = h^{(k)}_{\min}(B)]\big] = J(A,B)\) — averaging agreement over \(n\) independent hashes is an unbiased estimator of Jaccard.

Here is the simpler way to see why it works. Shuffle every possible shingle into a random order. For each set, look only at which of its shingles came out first in that shuffle. The two sets pick the same first shingle exactly when the overall-first shingle (out of everything in either set) happens to sit in their overlap — and the overlap is a \(|A\cap B| / |A\cup B|\) slice of that combined pool. So “do the two first-picks match?” is a coin flip weighted by the Jaccard. Repeat with \(n\) independent shuffles and the fraction of matching signature slots estimates \(J\).

Worked MinHash. Universe of shingles \(\{a,b,c,d,e\}\), two sets \(A=\{a,c,d\}\), \(B=\{a,d,e\}\). True Jaccard: \(|A\cap B|=|\{a,d\}|=2\), \(|A\cup B|=|\{a,c,d,e\}|=4\), so \(J=2/4=0.5\).

Take two hash functions \(h_1(x)=(x+1)\bmod 5\) and \(h_2(x)=(3x+1)\bmod 5\) with \(a..e = 0..4\):

x \(h_1\) \(h_2\)
a(0) 1 1
c(2) 3 2
d(3) 4 0
e(4) 0 3

\(h_{1,\min}(A)=\min(1,3,4)=1\); \(h_{1,\min}(B)=\min(1,4,0)=0\) → disagree. \(h_{2,\min}(A)=\min(1,2,0)=0\); \(h_{2,\min}(B)=\min(1,0,3)=0\) → agree.

Estimated \(J = 1/2 = 0.5\) — matching the true value (two hashes is too few to be reliable in general; real signatures use 100–200).

Step 3 — LSH banding. Even with signatures, comparing all pairs is \(O(N^2)\). Locality-sensitive hashing makes only similar items collide. Split each signature of length \(n\) into \(b\) bands of \(r\) rows (\(n = b \cdot r\)). Hash each band to a bucket; two documents become candidate pairs if they collide in at least one band.

The probability that two documents with Jaccard \(s\) become candidates is:

\[ P(\text{candidate}) = 1 - (1 - s^{\,r})^{b} \]

In words: a pair matches a single band only if all \(r\) of its rows agree (\(s^r\)); it becomes a candidate unless it fails every one of the \(b\) bands — hence one minus the chance of all-band failure. Also written: \(P(\text{candidate}) = 1 - \big(1 - s^{r}\big)^{b}\), with single-band match \(p = s^r\) and all-band miss \((1-p)^b\).

This is an S-curve with a threshold near \(t \approx (1/b)^{1/r}\): pairs above the threshold almost always collide, pairs below almost never do. Tune \(b\) and \(r\) to place the steep part where you want it.

Jaccard similarity s P(candidate) threshold t ≈ (1/b)^(1/r) no collision collision
Warning

LSH trades exactness for speed. False negatives (similar pairs that never collide) below the threshold are lost forever — there is no second pass. Pick \(b,r\) so the S-curve threshold sits below your true similarity cutoff, accepting more false-positive candidates (cheap to filter) rather than missing real duplicates.

43.10 Mining streams with sublinear memory

When data arrives as an unbounded stream — clicks, packets, log lines — you cannot store it all. The goal is a one-pass algorithm using memory sublinear in the stream size, accepting an approximate answer. Five classics cover the common questions.

Reservoir sampling — keep a uniform sample of \(k\) items. Keep the first \(k\). For the \(i\)-th item (\(i > k\)), keep it with probability \(k/i\), evicting a random current member. Every item seen so far ends up in the reservoir with equal probability \(k/n\), in one pass with \(O(k)\) memory and no knowledge of \(n\) in advance.

import random
def reservoir(stream, k):
    res = []
    for i, x in enumerate(stream):
        if i < k: res.append(x)
        elif random.random() < k/(i+1):
            res[random.randrange(k)] = x       # evict a random slot
    return res

Bloom filter — approximate set membership. A bit array of \(m\) bits and \(h\) hash functions. To insert \(x\), set the \(h\) bits at \(\text{hash}_1(x)\dots\text{hash}_h(x)\). To query, check those bits: if any is 0, \(x\) is definitely not present; if all are 1, \(x\) is probably present. No false negatives, tunable false positives, and no stored elements — just bits. Used everywhere: “have I seen this URL?”, database read-skipping, CDN caches. False-positive rate \(\approx (1 - e^{-hn/m})^h\).

In words: the false-positive rate is the chance that all \(h\) of an absent item’s bits happen to have been switched on by the \(n\) items already inserted. Also written: \(\varepsilon \approx \big(1 - e^{-hn/m}\big)^{h}\), minimized when \(h = (m/n)\ln 2\), giving \(\varepsilon \approx 0.6185^{\,m/n}\).

Here is the Bloom filter as a doodle — three hash functions lighting up three bits per insert; a query that finds any unlit bit is a guaranteed “no”:

insert “cat” h₁,h₂,h₃ → set 3 bits to 1 query “dog” 0 one 0 bit → definitely absent

Flajolet–Martin / HyperLogLog — count distinct elements. How many unique users, with kilobytes instead of gigabytes? The trick: hash each element to a bit string; track the maximum number of leading zeros seen. A hash with \(\rho\) leading zeros appears with probability \(2^{-(\rho+1)}\), so seeing \(\rho\) leading zeros suggests roughly \(2^{\rho}\) distinct elements. HyperLogLog refines this by splitting the hash into \(2^p\) registers (each holding a max-leading-zeros count), then taking a bias-corrected harmonic mean across registers. With ~1.5 KB it estimates cardinalities into the billions at ~2% error.

Worked intuition. If the longest run of leading zeros you have ever seen in a hashed stream is 10, then a 1-in-\(2^{10}\) ≈ 1-in-1024 event has occurred — evidence of roughly 1024 distinct items. One number, the running max, summarizes the whole stream.

Count–Min sketch — approximate frequency of any item. A 2D array of counters with \(d\) rows, each row owning an independent hash into \(w\) columns. To record an item, increment one counter per row (\(d\) counters total). To query its count, take the minimum across the \(d\) rows — collisions only ever add to a counter, so the smallest row is the least-polluted estimate. It never underestimates, overestimates with bounded error, and answers “how often did this IP/term/key appear?” in \(O(d\cdot w)\) space instead of one counter per distinct key. The natural companion to HyperLogLog: HLL counts how many distinct, Count–Min counts how often each.

In words: the estimate is the smallest of the \(d\) counters the item hashes to, because every collision can only inflate a counter, never deflate it — so the minimum is the least-contaminated guess. Also written: \(\hat{f}(x) = \min_{r=1}^{d} C[r,\,h_r(x)]\), with the guarantee \(f(x) \le \hat{f}(x) \le f(x) + \varepsilon\|f\|_1\) holding with probability \(1-\delta\) when \(w = \lceil e/\varepsilon\rceil\), \(d = \lceil\ln(1/\delta)\rceil\).

Worked Count–Min. Two rows, four columns. Insert “a” three times and “b” once. Suppose row 1 sends both “a” and “b” to column 2 (a collision), while row 2 sends “a” to column 0 and “b” to column 3 (no collision). The counters end up:

col 0 col 1 col 2 col 3
row 1 0 0 4 0
row 2 3 0 0 1

Querying “a”: row 1 gives 4 (polluted by “b”), row 2 gives 3. The minimum is 3 — the true count. The collision inflated one row, but taking the min reads past it. Querying “b”: \(\min(4,1)=1\), also exact. The min rescues the estimate whenever some row avoided a collision.

DGIM — count 1s in a sliding window. Over the last \(N\) bits of a 0/1 stream, how many are 1? Storing \(N\) bits is too much. DGIM keeps buckets of exponentially increasing size (\(1,1,2,2,4,4,\dots\) ones each), at most two of each size, storing only each bucket’s end timestamp. Drop buckets that fall out of the window; estimate the count by summing bucket sizes and halving the oldest (still-partially-in-window) bucket. This needs only \(O(\log^2 N)\) bits and answers within a guaranteed 50% relative error (tightenable by keeping more buckets per size).

question algorithm memory
uniform sample of \(k\) reservoir sampling \(O(k)\)
“have I seen \(x\)?” Bloom filter \(O(m)\) bits
how many distinct? HyperLogLog \(O(2^p)\) registers
how often is \(x\)? Count–Min sketch \(O(d\cdot w)\)
how many 1s in last \(N\)? DGIM \(O(\log^2 N)\)

Off-the-shelf sketches. You rarely hand-roll these in production. datasketch provides battle-tested HyperLogLog and MinHash-LSH:

from datasketch import HyperLogLog
hll = HyperLogLog(p=14)                  # ~16 KB, ~1% error
for visitor_id in stream:                # one pass, no per-id storage
    hll.update(visitor_id.encode("utf8"))
print(int(hll.count()))                  # estimated distinct visitors
# datasketch also ships MinHash + MinHashLSH for the §43.9 dedup pipeline.
Tip

These sketches are mergeable: build one per shard or per hour, then combine them (OR the Bloom bits, max the HLL registers, add the Count–Min cells) to get the sketch of the union — exactly, with no re-scan. That is what makes them the backbone of distributed analytics; each worker summarizes its slice independently and you fold the summaries at the end.

43.11 Frequent itemsets and the PCY refinement

A staple of data mining: find sets of items that co-occur frequently in transactions — “customers who buy bread and butter also buy jam”. An itemset is frequent if it appears in at least a support threshold \(s\) of baskets. The classic Apriori algorithm (covered in Clustering & Unsupervised Learning) exploits the downward-closure property: a \(k\)-itemset can be frequent only if all its \((k\!-\!1)\)-subsets are frequent, so it builds up level by level, pruning aggressively.

Apriori’s bottleneck is the first pass into pairs: counting all candidate pairs needs a count for every pair of frequent items, which can blow up memory. PCY (Park–Chen–Yu) refines pass 1 to shrink the pair candidates.

The idea: pass 1 already scans every basket to count single items, and during that scan it has spare time. PCY uses it to hash every pair into a hash table of counters (just counters, not the pairs themselves) and increment buckets. After pass 1, any bucket whose count is below \(s\) cannot contain a frequent pair. Collapse that bucket table into a bitmap (1 = frequent bucket). In pass 2, a pair is a candidate only if both items are frequent AND the pair hashes to a frequent bucket — the extra bitmap condition eliminates many pairs that Apriori would have counted.

flowchart LR
    P1["Pass 1: count items<br/>+ hash pairs into buckets"] --> BM["Bitmap:<br/>1 = frequent bucket"]
    BM --> P2["Pass 2: count pair only if<br/>both items frequent AND<br/>bucket bit = 1"]
    P2 --> FP["Frequent pairs"]

The payoff is memory: the bucket bitmap is tiny compared to a full pair-count table, letting PCY handle datasets where Apriori’s candidate set would not fit. The win is largest when the bucket table is well-filled early — turning idle pass-1 cycles into pruning power for free.

With mlxtend. For mining frequent itemsets and association rules in practice, mlxtend provides Apriori and FP-Growth on a one-hot basket matrix:

import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
baskets = pd.DataFrame({                 # one-hot: rows = transactions
    "bread":  [1, 1, 0, 1],
    "butter": [1, 1, 0, 1],
    "jam":    [1, 0, 0, 1],
})
freq = apriori(baskets, min_support=0.5, use_colnames=True)
rules = association_rules(freq, metric="confidence", min_threshold=0.7)
print(rules[["antecedents", "consequents", "support", "confidence", "lift"]])

43.12 Submodular optimization for coverage and influence

Many mining tasks are selection problems: pick \(k\) sensors to cover an area, \(k\) documents to summarize a topic, \(k\) seed users to maximize influence in a social network. These share a structure that makes a simple greedy algorithm provably near-optimal: submodularity, the formal version of diminishing returns.

A set function \(f\) is submodular if adding an element to a smaller set helps at least as much as adding it to a larger one. For \(A \subseteq B\) and element \(x \notin B\):

\[ f(A \cup \{x\}) - f(A) \;\ge\; f(B \cup \{x\}) - f(B) \]

In words: the same new item is worth more when you have little than when you already have a lot — diminishing returns, written exactly. Also written: defining the marginal gain \(\Delta_f(x \mid S) = f(S \cup \{x\}) - f(S)\), submodularity is \(\Delta_f(x \mid A) \ge \Delta_f(x \mid B)\) for all \(A \subseteq B\).

It is monotone if adding elements never hurts (\(f(A) \le f(B)\) when \(A \subseteq B\)). Coverage is the canonical example: the first document you pick covers many topics; once you have ten, an eleventh covers little new ground — exactly diminishing returns.

Here is diminishing returns as a picture — each new pick adds a smaller fresh slice of coverage than the last:

# items selected coverage +big +tiny

The remarkable result: for a monotone submodular \(f\) with a cardinality constraint, the greedy algorithm — repeatedly add the element with the largest marginal gain — achieves at least

\[ \left(1 - \tfrac{1}{e}\right) \approx 0.632 \]

In words: the dumb “grab the biggest remaining gain each round” loop is guaranteed to reach at least 63% of the best-possible value, no matter the problem. Also written: \(f(S_{\text{greedy}}) \ge (1 - 1/e)\,f(S^{*})\), where \(S^{*}\) is the optimal size-\(k\) set.

of the optimal value. No polynomial algorithm does better in general (under standard assumptions), so a one-line greedy loop gets you within ~37% of optimal on an otherwise NP-hard problem.

Worked example — set cover. Cover the universe \(\{1,2,3,4,5,6,7,8\}\) picking 2 sets:

set elements
\(S_1\) 1,2,3,4,5
\(S_2\) 4,5,6
\(S_3\) 6,7,8

Greedy step 1: \(S_1\) has the largest marginal gain (5 new), pick it → covered \(\{1..5\}\). Greedy step 2: \(S_2\) adds \(\{6\}\) = 1 new; \(S_3\) adds \(\{6,7,8\}\) = 3 new → pick \(S_3\), covered \(\{1,2,3,4,5,6,7,8\}\) = all 8.

Greedy found the optimum here. The \(1-1/e\) bound is the worst-case guarantee; in practice greedy is usually much closer. This same machinery underlies influence maximization (Kempe–Kleinberg) for viral marketing, where \(f\) is the expected number of users influenced — also monotone submodular.

# greedy max-coverage in a few lines
def greedy_cover(sets, k):
    chosen, covered = [], set()
    for _ in range(k):
        best = max(sets, key=lambda s: len(set(s) - covered))   # largest marginal gain
        chosen.append(best); covered |= set(best)
    return chosen, covered

sets = {"S1":[1,2,3,4,5], "S2":[4,5,6], "S3":[6,7,8]}
print(greedy_cover(sets.values(), 2))   # picks S1 then S3 -> all 8 elements
Tip

For huge ground sets, lazy greedy (CELF) exploits submodularity: a marginal gain can only shrink as the selected set grows, so cache gains in a priority queue and re-evaluate only the top candidate. Same answer as greedy, often orders of magnitude faster.

43.13 CUR decomposition: interpretable low-rank approximation

Low-rank approximation compresses a big matrix into a small one. SVD (Linear Algebra) gives the optimal rank-\(k\) approximation \(A \approx U_k \Sigma_k V_k^\top\) — but its factors are dense linear combinations of all rows and columns. The 3rd singular vector of a customer-by-product matrix is a blend of every product with positive and negative weights: mathematically optimal, humanly meaningless.

CUR trades a little accuracy for interpretability by building the factorization from actual rows and columns of the matrix:

\[ A \approx C\,U\,R \]

In words: approximate the matrix as some of its real columns times a small glue matrix times some of its real rows — so the building blocks are actual data, not abstract directions. Also written: \(A \approx C\,U\,R\) with \(U = (C^{+} A R^{+})\) (or, in the classic construction, the pseudo-inverse of the row–column intersection \(W\)), where \(C^{+}\) denotes the Moore–Penrose pseudo-inverse.

where \(C\) is a sample of actual columns of \(A\), \(R\) is a sample of actual rows, and \(U\) is a small linking matrix (the pseudo-inverse of the intersection of the chosen rows and columns). Because \(C\) and \(R\) are real data — real products, real customers — the factors mean something. They are also sparse if \(A\) is sparse (SVD’s factors are dense even when \(A\) is sparse, a serious memory problem at web scale).

The sampling is not uniform. Columns and rows are chosen with probability proportional to their statistical leverage scores — the squared norms in the top singular directions, i.e. how much each row/column contributes to the matrix’s dominant structure. With \(O(k \log k / \epsilon^2)\) sampled columns and rows, CUR achieves a Frobenius-norm error within \((1+\epsilon)\) of the best rank-\(k\) SVD.

flowchart LR
    A["A (m×n)"] -->|sample columns by leverage| C["C: real columns"]
    A -->|sample rows by leverage| R["R: real rows"]
    C --> U["U: link matrix<br/>(pinv of intersection)"]
    R --> U
    U --> AP["A ≈ C U R<br/>interpretable, sparse-preserving"]

SVD CUR
factors abstract singular vectors actual rows/columns
accuracy optimal \((1+\epsilon)\) of optimal
sparsity dense even if \(A\) sparse preserves sparsity
interpretability low high

Use SVD when you want the best compression and the factors are just an intermediate step (PCA, latent semantic indexing). Use CUR when a human must read the factors — selecting representative genes, products, or documents — or when preserving sparsity matters for memory.

43.1 — Putting it together: index-and-retrieve

The pieces above assemble into one pipeline. Offline you build the index; online you retrieve, rank, and (for RAG) generate.

flowchart TB
    subgraph OFF["Offline: index build"]
        D[corpus] --> TOK[tokenize / shingle]
        TOK --> INV[inverted index<br/>compressed postings]
        TOK --> EMB[document encoder] --> ANN[ANN vector index]
        TOK --> DUP[MinHash + LSH<br/>dedup near-duplicates]
        DUP -.drop dupes.-> INV
    end
    subgraph ON["Online: query time"]
        Q[query] --> C1[BM25 candidate gen<br/>from inverted index]
        Q --> C2[dense retrieval<br/>from ANN index]
        C1 --> FUSE[score fusion]
        C2 --> FUSE
        FUSE --> RR[reranker<br/>cross-encoder / LambdaMART]
        RR --> TOPK[top-k results]
        TOPK --> RAG[LLM generation - RAG, Ch 23]
    end

The shape is universal: cheap recall-oriented candidate generation (BM25, ANN) narrows billions to thousands, then expensive precision-oriented reranking (cross-encoder, LambdaMART) orders the few. Mining algorithms thread throughout — LSH dedupes the corpus, HyperLogLog tracks distinct queries, Bloom filters skip seen URLs during crawl.

Hybrid fusion in a few lines. The fusion box above is often just reciprocal-rank fusion (RRF) — combine two ranked lists by summing \(1/(k+\text{rank})\), no score calibration needed:

def rrf(rankings, k=60):                  # rankings: list of ranked docID lists
    scores = {}
    for ranked in rankings:
        for rank, doc in enumerate(ranked, start=1):
            scores[doc] = scores.get(doc, 0) + 1/(k + rank)
    return sorted(scores, key=scores.get, reverse=True)

bm25_list  = ["d3", "d1", "d7"]
dense_list = ["d1", "d3", "d9"]
print(rrf([bm25_list, dense_list]))       # d1 and d3 rise via agreement

43.2 — Quick reference

Term / method What it is When / why
Inverted index term → sorted postings list of docIDs the backbone of all keyword search; sorted postings give \(O(\|A\|+\|B\|)\) merge-intersection
Gap + VByte compression store deltas, encode small ints cheaply shrink huge postings; frequent terms compress best
Positional index postings store term positions, not just docID phrase (“new york”) and proximity (NEAR/\(k\)) queries
TF-IDF \((1+\log\text{tf})\cdot\log(N/\text{df})\), ranked by cosine first ranking baseline; down-weights common terms
BM25 tf-saturating, length-normalized score (\(k_1,b\)) production default for 20 years; use over raw TF-IDF
BM25F pool fielded tf then saturate structured docs (title/body/anchor); never sum per-field BM25
Rocchio / PRF / HyDE move query toward relevant docs / expand terms bridge vocabulary mismatch before retrieval runs
MAP mean of average precision over queries binary-relevance ranking metric, rewards early hits
nDCG@k \(\sum (2^{\text{rel}_i}-1)/\log_2(i+1)\), normalized graded relevance + position discount; compare across queries
LambdaMART boosted trees, gradient \(=\) RankNet \(\times\,\|\Delta\text{nDCG}\|\) strong learning-to-rank; remember the group array
Dense / DPR one vector per query & doc, ANN dot-product fast semantic recall; loses fine-grained term match
ColBERT (MaxSim) per-token vectors, sum of row-maxima token-level semantics; costs storage (many vectors/doc)
SPLADE learned sparse term weights on inverted index neural semantics on classic, fast infrastructure
PageRank stationary dist. of random surfer, \(d\approx0.85\) query-independent importance from the link graph
MinHash \(\Pr[\text{collision}]=\) Jaccard compress shingle sets; estimate similarity cheaply
LSH banding candidate if any band collides; S-curve at \((1/b)^{1/r}\) sub-quadratic near-duplicate detection
Bloom filter \(h\) bits per item, no false negatives “have I seen \(x\)?” with bits, not stored elements
HyperLogLog max-leading-zeros across registers, harmonic mean count distinct in KBs; mergeable across shards
Count–Min sketch \(\min\) over \(d\) hashed counters per-item frequency; never underestimates
Submodular greedy add largest marginal gain each round \(1-1/e\) of optimal for coverage / influence
CUR \(A\approx CUR\) from real rows/columns interpretable, sparsity-preserving low-rank approx

43.3 — Key takeaways

  • The inverted index (term → sorted postings, gap-compressed) makes Boolean and ranked retrieval fast; sorted postings turn intersection into a linear merge. Positional indexes add phrase and proximity search.
  • TF-IDF ranks by term importance; BM25 fixes its flaws with tf saturation (\(k_1\)) and length normalization (\(b\)); BM25F pools fielded frequencies before saturating.
  • Query expansion / relevance feedback (Rocchio, pseudo-relevance, LLM rewriting) bridges vocabulary gaps before retrieval even runs.
  • Evaluate ranking with order-aware metrics: MAP (binary relevance), nDCG (graded relevance + position discount). Always normalize against the ideal ranking.
  • Learning to rank spans pointwise → pairwise → listwise; LambdaMART trains boosted trees with gradients scaled by \(\Delta\)nDCG — pairwise mechanics, listwise awareness.
  • Neural IR beats vocabulary mismatch: dense/DPR (one vector, ANN search), ColBERT (per-token MaxSim), SPLADE (learned sparse on an inverted index). Hybrid + rerank wins; these feed RAG (Ch. 23).
  • Link analysis: PageRank (query-independent random-surfer importance) and HITS (hubs and authorities) rank by graph structure, not just text.
  • Near-duplicate detection: shingling → MinHash (match probability = Jaccard) → LSH banding (S-curve threshold \(\approx (1/b)^{1/r}\)); tune to avoid false negatives.
  • Streaming with sublinear memory: reservoir sampling (uniform sample), Bloom filters (membership, no false negatives), HyperLogLog (distinct counts), Count–Min sketch (per-item frequency), DGIM (window counts) — and all of these merge across shards.
  • PCY prunes Apriori’s pair candidates with a free pass-1 bucket bitmap. Submodular greedy gets \(1-1/e\) of optimal for coverage/influence. CUR is interpretable, sparsity-preserving low-rank approximation when SVD’s abstract factors won’t do.

43.4 — See also

  • Ch. 03 — Linear algebra: SVD, leverage scores, the low-rank approximation CUR approximates; eigenvectors behind PageRank/HITS.
  • Ch. 06 — Evaluation: precision/recall and the classification view of relevance.
  • Ch. 11 — Classical ML / association rules: gradient-boosted trees (MART) behind LambdaMART, and the Apriori algorithm PCY refines.
  • Ch. 14 / 18 — Deep learning & Transformers: the encoders powering dense, ColBERT, and SPLADE retrieval.
  • Ch. 23 — LLMs & RAG: approximate nearest-neighbor indexes and the full retrieval-augmented generation pipeline neural IR feeds.
  • Ch. 40 — Graph ML: influence maximization on social graphs, where submodular maximization is applied; graph centrality behind PageRank/HITS.

↪ The thread continues → Chapter 44 · 🏗️ LLM Systems: Building LLMs from Scratch

Retrieval feeds the most demanding systems of all. The next chapter opens the hood on building a large language model from scratch — tokenizers, kernels, parallelism, scaling laws.


📖 All chapters  |  ← 42 · 📐 Learning Theory  |  44 · 🏗️ LLM Systems: Building LLMs from Scratch →

 

© Kader Mohideen