Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • Mini Courses
  • Extra
    • AI & ML Encyclopedia
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

In this lesson

  • Lesson 8 — Synthetic Instruction Data: Your Teacher, Your Dataset, on GitHub
    • Why generate your own data (and why ground it in Wikipedia)
    • Spin up the teacher: vLLM on your 4090
    • The task mix: five grounded task types
    • The chat format: reserved tokens, decided in Lesson 4
    • src/gen_sft_data.py — the full generator
    • Run it: cost, throughput, and a held-out split
    • Publish it: a public GitHub dataset repo
    • 🧪 Your task
    • Key takeaways
    • Coming up

📖 Build Your Own Wikipedia LLM · Lesson 8 — Synthetic Instruction Data: Your Teacher, Your Dataset, on GitHub

🏠 📖 Course home  |  ← Lesson 07  |  Lesson 09 →  |  📚 All mini-courses


Lesson 8 — Synthetic Instruction Data: Your Teacher, Your Dataset, on GitHub

In Lesson 7 you walked away with something real: checkpoints/ckpt_final.pt, a base WikiGPT-124M that has read ~4B tokens of Wikipedia and can continue any prompt with plausible encyclopedic prose. But try talking to it. Prompt it with “What is photosynthesis?” and it won’t answer — it will continue, maybe with “is a question often asked by students of biology…” or a list of related questions. A base model is an autocomplete engine. It has knowledge but no concept of a conversation, an instruction, or a role.

The fix is supervised fine-tuning (SFT) on (instruction, response) pairs — and the interesting part of this course is that we don’t download someone else’s dataset. We manufacture our own: a bigger open model (Qwen2.5-7B-Instruct) acts as the teacher, your cleaned Wikipedia corpus acts as the grounding material, and a generation script turns the two into 50–80k high-quality, provenance-tracked instruction pairs. Then we publish the whole thing to a public GitHub repo so anyone can reproduce your model. This lesson builds src/gen_sft_data.py end to end.

🎯 In this lesson you will: serve Qwen2.5-7B-Instruct with vLLM on your rented 4090, build a self-instruct-style seed prompt library over five grounded task types, write src/gen_sft_data.py to generate + filter + dedup ~50–80k chat-formatted instruction pairs for about $2 of GPU time, and publish the dataset to a public GitHub repo with schema, provenance, and license documentation.

Why generate your own data (and why ground it in Wikipedia)

Three reasons we synthesize instead of downloading:

  1. Distribution match. WikiGPT’s entire world is Wikipedia. Generic instruction sets (coding help, creative writing, roleplay) ask a 124M model to answer questions it has never seen the substrate for — that teaches confident hallucination. Instruction data grounded in the exact corpus the model pretrained on teaches it to surface knowledge it actually has.
  2. Scale economics. A 7B teacher on a 4090 generates thousands of pairs per hour for pennies. Human-annotated data of this size costs tens of thousands of dollars.
  3. You own the pipeline. When Lesson 9’s SFT run exposes a weakness (say, the model can’t do extraction), you regenerate with a different task mix in an afternoon. That loop is impossible with a frozen third-party dataset.

The method is self-instruct with grounding: instead of asking the teacher to invent instructions from thin air (which drifts into repetitive, generic phrasing), every generation call feeds the teacher a real passage from your data/clean/ corpus and a task-specific seed prompt. The passage anchors the facts; the seed prompt controls the task type; sampling temperature provides diversity.

flowchart LR
    A[data/clean/<br/>Wikipedia passages] --> B[Passage sampler<br/>150–400 words]
    S[Seed prompt library<br/>5 task types] --> C
    B --> C[Teacher LLM<br/>Qwen2.5-7B-Instruct<br/>via vLLM, JSON mode]
    C --> D[Quality filters<br/>length · refusals ·<br/>passage-copy check]
    D --> E[Dedup<br/>exact sha1 +<br/>normalized prefix]
    E --> F[data/sft/sft_train.jsonl<br/>50–80k chat pairs]
    F --> G[Public GitHub repo<br/>README + shards + license]

Spin up the teacher: vLLM on your 4090

You can reuse the same vast.ai instance from Lesson 7 (the pretraining run is done; the GPU is idle) or rent a fresh one. Same workflow as always:

# On your laptop — find a 4090 if you don't have one running
vastai search offers 'gpu_name=RTX_4090 num_gpus=1 disk_space>80 inet_down>200' -o 'dph'
vastai create instance <OFFER_ID> --image pytorch/pytorch:2.4.0-cuda12.4-cudnn9-devel --disk 80

vastai show instances          # grab ssh host/port
ssh -p <PORT> root@<HOST>
tmux new -s teacher            # everything long-running lives in tmux

Install vLLM and start the server. vLLM exposes an OpenAI-compatible HTTP API, which is the whole trick: our generation script speaks the standard openai client protocol, so the teacher is swappable — point the same script at any OpenAI-compatible endpoint (a bigger model on another provider, a friend’s server) by changing one URL.

pip install vllm==0.6.3 openai

# Serve the teacher (takes ~2 min to download 15GB of weights the first time)
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.92 \
    --port 8000

Why each flag:

  • --max-model-len 4096 — our prompts are (passage ≤ 400 words) + (seed template) + (response ≤ ~500 tokens), comfortably under 4k. Capping the context length shrinks the per-sequence KV-cache reservation, which is what lets a 7B model and a deep request queue coexist on 24GB.
  • --gpu-memory-utilization 0.92 — Qwen2.5-7B in bf16 is ~15.2GB of weights. On a 24GB 4090 that leaves ~7GB; this flag tells vLLM to claim 92% of the card and spend everything above the weights on KV cache. More KV cache = more concurrent sequences = higher aggregate throughput.
  • If you OOM (another process holding memory, or a 4090 variant with less free VRAM): serve the AWQ-quantized variant instead — vllm serve Qwen/Qwen2.5-7B-Instruct-AWQ --quantization awq ... — weights drop to ~5.5GB with negligible quality loss for this task.

Sanity-check from a second tmux window (tmux new -s work):

curl -s http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen/Qwen2.5-7B-Instruct",
       "messages": [{"role": "user", "content": "Say OK."}],
       "max_tokens": 5}' | python -c "import sys,json; print(json.load(sys.stdin)['choices'][0]['message']['content'])"

If that prints something like OK., your teacher is live. With continuous batching and ~32 concurrent requests, expect 1,500–2,500 output tokens/s aggregate — that number sets the cost of this lesson.

The task mix: five grounded task types

A model SFT’d on one task type becomes a one-trick pony. We want WikiGPT to answer questions, summarize, extract, and explain — so the dataset mixes five task types, all grounded in a sampled passage. The mix is deliberately QA-heavy because QA is what people actually do with a chat model:

Task type Share What the pair looks like Passage in the user turn?
closed_qa 35% Question answerable from the passage; passage included in the user message; answer cites only the passage Yes
summarize 20% “Summarize the following passage in N sentences” + passage Yes
extract 15% “List all the dates/names/places mentioned…” + passage; structured answer Yes
eli5 15% “Explain in simple terms” — passage grounds the teacher’s answer but is not shown to the student model No
open_qa 15% Standalone factual question; passage is the gold source for the teacher’s answer No

The last column is the subtle design decision. For closed_qa/summarize/extract, the passage appears inside the user turn — the model learns reading comprehension over provided context (which also sets up any future RAG use). For eli5/open_qa, the user turn is just the question — the model learns to recall from its own pretrained weights, and the passage only keeps the teacher’s answer factual. A 124M model will be much weaker at the second kind; that’s expected and honest, and the 70/30-ish split toward grounded tasks reflects it.

The chat format: reserved tokens, decided in Lesson 4

Back in Lesson 4 we reserved three special tokens in the tokenizer — <|user|>, <|assistant|>, <|end|> — precisely so that this moment requires zero tokenizer surgery. Every training example in Lesson 9 will be rendered as:

<|user|> instruction (+passage) <|end|> <|assistant|> response <|end|> ◀── loss masked out (Lesson 9) ──▶ ◀─ loss ON ─▶

The dataset itself stores structured messages, not the rendered string — rendering (and loss-masking to assistant tokens only) is train_sft.py’s job in Lesson 9. Storing structure keeps the dataset format-agnostic and reusable by other people with other tokenizers. One line of data/sft/sft_train.jsonl:

{"id": "sft-000042", "task_type": "closed_qa", "source_title": "Photosynthesis",
 "messages": [
   {"role": "user", "content": "Read the passage and answer: what pigment absorbs light?\n\nPassage:\nPhotosynthesis is..."},
   {"role": "assistant", "content": "According to the passage, chlorophyll absorbs light..."}
 ]}

src/gen_sft_data.py — the full generator

Now the script. The architecture is a straight pipeline: sample passages → build prompts from the seed library → hit the teacher concurrently (vLLM batches internally, so client-side concurrency is just a thread pool) → parse JSON → filter → dedup → write JSONL. It’s restartable: output is appended per-result and existing IDs are skipped on resume, because a 3-hour generation run will get interrupted at least once.

# src/gen_sft_data.py
"""Generate grounded synthetic SFT data from the cleaned Wikipedia corpus.

Teacher: any OpenAI-compatible endpoint (default: local vLLM serving
Qwen2.5-7B-Instruct). Output: chat-format JSONL in data/sft/.

Usage:
    python src/gen_sft_data.py --n-pairs 60000 --out data/sft/sft_train.jsonl
"""
import argparse, hashlib, json, os, random, re, sys, time
from concurrent.futures import ThreadPoolExecutor, as_completed
from difflib import SequenceMatcher
from pathlib import Path

from openai import OpenAI

# ---------------------------------------------------------------- seed library
# Each template asks the teacher for a JSON object {"instruction", "response"}.
# {passage} is substituted in. Multiple phrasings per task = output diversity.
SEED_PROMPTS = {
    "closed_qa": [
        "Read this Wikipedia passage and write one specific question that can be "
        "answered ONLY from the passage, plus a correct, complete answer that "
        "cites only facts in the passage.\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "<the question>", "response": "<the answer>"}}',
        "From the passage below, create a challenging comprehension question "
        "(who/what/when/why/how) and answer it using only the passage. Avoid "
        "trivial questions whose answer is the passage's first sentence.\n\n"
        "Passage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
    ],
    "summarize": [
        "Write an instruction asking to summarize the passage below in 2-3 "
        "sentences, then write that summary. The summary must be faithful and "
        "must not copy sentences verbatim.\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
        "Create a (instruction, response) pair where the instruction asks for "
        "a one-paragraph summary of the passage for a general reader, and the "
        "response is that summary.\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
    ],
    "extract": [
        "Write an instruction asking to extract specific structured facts from "
        "the passage (e.g. all dates, all named people, all locations, or all "
        "numeric quantities - pick whichever the passage is rich in), then the "
        "response as a bullet list. Only include facts present in the passage."
        "\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
    ],
    "eli5": [
        "Pick the central concept of the passage below. Write an instruction "
        "of the form 'Explain <concept> in simple terms' (do NOT mention the "
        "passage in the instruction), and a friendly, accurate explanation a "
        "curious 12-year-old would understand, grounded in the passage's facts."
        "\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
    ],
    "open_qa": [
        "Using the passage below as your source of truth, write a standalone "
        "factual question about its topic (the question must NOT reference "
        "'the passage') and a self-contained, accurate answer of 2-5 sentences."
        "\n\nPassage:\n{passage}\n\n"
        'Return JSON: {{"instruction": "...", "response": "..."}}',
    ],
}
# task-type mix (must sum to 1.0) - see lesson table for rationale
TASK_MIX = {"closed_qa": 0.35, "summarize": 0.20, "extract": 0.15,
            "eli5": 0.15, "open_qa": 0.15}
# passage included in the student-visible user turn for these tasks only:
PASSAGE_IN_USER_TURN = {"closed_qa", "summarize", "extract"}

SYSTEM_PROMPT = (
    "You are a meticulous dataset annotator. You always respond with a single "
    "valid JSON object and nothing else. You never refuse, never add "
    "disclaimers, and never invent facts not supported by the given passage."
)

REFUSAL_PATTERNS = re.compile(
    r"as an ai|i cannot|i can't|i'm sorry|i am sorry|i apologize|"
    r"language model|i don't have access|cannot assist", re.IGNORECASE)

# ---------------------------------------------------------------- passages
def iter_passages(clean_dir: Path, min_words=150, max_words=400, seed=1337):
    """Yield (title, passage) sampled from Lesson 3's cleaned JSONL shards."""
    files = sorted(clean_dir.glob("*.jsonl"))
    if not files:
        sys.exit(f"no cleaned shards found in {clean_dir} - run Lesson 3 first")
    rng = random.Random(seed)
    while True:
        f = rng.choice(files)
        with open(f, encoding="utf-8") as fh:
            lines = fh.readlines()
        for line in rng.sample(lines, min(64, len(lines))):
            doc = json.loads(line)
            words = doc["text"].split()
            if len(words) < min_words:
                continue
            # random window -> different passages from the same article
            start = rng.randint(0, max(0, len(words) - max_words))
            n = rng.randint(min_words, max_words)
            yield doc.get("title", ""), " ".join(words[start:start + n])

# ---------------------------------------------------------------- filters
def passes_filters(inst: str, resp: str, passage: str) -> str | None:
    """Return None if the pair is good, else a rejection reason (for stats)."""
    if not (12 <= len(inst) <= 2000):          return "inst_length"
    if not (20 <= len(resp) <= 3000):          return "resp_length"
    if REFUSAL_PATTERNS.search(resp):          return "refusal"
    if REFUSAL_PATTERNS.search(inst):          return "refusal"
    # reject responses that just copy the passage: cheap ratio on a prefix
    sim = SequenceMatcher(None, resp[:600], passage[:600]).ratio()
    if sim > 0.85:                             return "passage_copy"
    return None

def norm_key(inst: str) -> str:
    """Dedup key: lowercase, alphanumeric-only, first 12 words."""
    words = re.sub(r"[^a-z0-9 ]", "", inst.lower()).split()
    return " ".join(words[:12])

# ---------------------------------------------------------------- generation
def gen_one(client, model, task_type, title, passage, temperature):
    template = random.choice(SEED_PROMPTS[task_type])
    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "system", "content": SYSTEM_PROMPT},
                  {"role": "user", "content": template.format(passage=passage)}],
        temperature=temperature,
        max_tokens=700,
        response_format={"type": "json_object"},   # JSON mode: vLLM + OpenAI both honor it
    )
    obj = json.loads(resp.choices[0].message.content)
    inst, ans = obj["instruction"].strip(), obj["response"].strip()
    reason = passes_filters(inst, ans, passage)
    if reason:
        return None, reason
    user_content = (f"{inst}\n\nPassage:\n{passage}"
                    if task_type in PASSAGE_IN_USER_TURN else inst)
    return {"task_type": task_type, "source_title": title,
            "messages": [{"role": "user", "content": user_content},
                         {"role": "assistant", "content": ans}]}, None

def main():
    ap = argparse.ArgumentParser()
    ap.add_argument("--clean-dir", default="data/clean")
    ap.add_argument("--out", default="data/sft/sft_train.jsonl")
    ap.add_argument("--n-pairs", type=int, default=60000)
    ap.add_argument("--base-url", default="http://localhost:8000/v1")
    ap.add_argument("--api-key", default="EMPTY")   # vLLM ignores it; real APIs need it
    ap.add_argument("--model", default="Qwen/Qwen2.5-7B-Instruct")
    ap.add_argument("--concurrency", type=int, default=32)
    ap.add_argument("--temperature", type=float, default=0.8)
    args = ap.parse_args()

    client = OpenAI(base_url=args.base_url, api_key=args.api_key)
    out_path = Path(args.out); out_path.parent.mkdir(parents=True, exist_ok=True)

    # resume support: reload dedup keys from an existing output file
    seen, exact = set(), set()
    n_done = 0
    if out_path.exists():
        with open(out_path, encoding="utf-8") as fh:
            for line in fh:
                r = json.loads(line)
                inst = r["messages"][0]["content"]
                seen.add(norm_key(inst))
                exact.add(hashlib.sha1(inst.encode()).hexdigest())
                n_done += 1
        print(f"resuming: {n_done} pairs already on disk")

    passages = iter_passages(Path(args.clean_dir))
    tasks = list(TASK_MIX); weights = [TASK_MIX[t] for t in tasks]
    stats = {"ok": 0, "dup": 0, "parse_error": 0}
    t0 = time.time()

    with open(out_path, "a", encoding="utf-8") as out, \
         ThreadPoolExecutor(max_workers=args.concurrency) as pool:
        futures = set()
        while n_done < args.n_pairs:
            # keep the pool saturated so vLLM's batcher stays full
            while len(futures) < args.concurrency * 2 and n_done + len(futures) < args.n_pairs + 200:
                title, passage = next(passages)
                tt = random.choices(tasks, weights=weights)[0]
                futures.add(pool.submit(gen_one, client, args.model, tt,
                                        title, passage, args.temperature))
            done_set = {f for f in list(futures) if f.done()}
            if not done_set:
                time.sleep(0.2); continue
            for fut in done_set:
                futures.discard(fut)
                try:
                    rec, reason = fut.result()
                except Exception:
                    stats["parse_error"] += 1; continue   # bad JSON / timeout: just drop it
                if rec is None:
                    stats[reason] = stats.get(reason, 0) + 1; continue
                inst = rec["messages"][0]["content"]
                h = hashlib.sha1(inst.encode()).hexdigest()
                k = norm_key(inst)
                if h in exact or k in seen:
                    stats["dup"] += 1; continue
                exact.add(h); seen.add(k)
                rec["id"] = f"sft-{n_done:06d}"
                out.write(json.dumps(rec, ensure_ascii=False) + "\n")
                n_done += 1; stats["ok"] += 1
                if n_done % 500 == 0:
                    rate = stats["ok"] / (time.time() - t0)
                    out.flush()
                    print(f"{n_done}/{args.n_pairs}  {rate:.1f} pairs/s  stats={stats}")

    print(f"done: {n_done} pairs -> {out_path}  reject-stats={stats}")

if __name__ == "__main__":
    main()

Line-by-line, the decisions that matter:

  • response_format={"type": "json_object"} — JSON mode. vLLM constrains decoding so the output is valid JSON; the OpenAI API honors the same field, keeping the script portable. Without it, expect 5–10% of outputs to be JSON wrapped in markdown fences or chatty preamble, all wasted tokens. (vLLM also supports strict schema enforcement via extra_body={"guided_json": <schema>} if you want to pin the exact keys — JSON mode plus a try/except is enough here.)
  • temperature=0.8 — diversity knob. At 0.2 the teacher writes the same five question shapes forever; at 1.2 factuality degrades. 0.7–0.9 is the standard band for synthetic data generation.
  • Client-side concurrency = 32, queue depth 2× — vLLM’s continuous batcher only helps if requests are waiting. Sequential requests would use ~5% of the GPU; 32 in flight keeps it >90% busy, which is the difference between a 2-hour run and a 30-hour run.
  • The passage-copy filter (SequenceMatcher ratio > 0.85) kills the failure mode where a lazy teacher “summarizes” by echoing the passage — training on those teaches your model to parrot its input.
  • The refusal filter matters more than it looks: a few hundred “As an AI language model, I cannot…” strings in SFT data will make a 124M model refuse constantly, because small models latch onto high-frequency surface patterns.
  • Two dedup layers: exact sha1 on the user turn, plus a normalized 12-word-prefix key that catches near-duplicates like “What year was X founded?” vs “What year was X founded”.
  • Append + resume — the script reloads dedup state from the output file on restart, so a dropped SSH session costs you nothing. (Same philosophy as train.py’s checkpointing in Lesson 6.)

Run it: cost, throughput, and a held-out split

Sync the code and cleaned data references (your data/clean/ shards should already be on the instance from Lesson 3 — if not, rsync them back up):

# from your laptop
rsync -avz -e "ssh -p <PORT>" wikillm/src/ root@<HOST>:/workspace/wikillm/src/

# on the instance, in the 'work' tmux window (teacher runs in the other one)
cd /workspace/wikillm
python src/gen_sft_data.py --n-pairs 60000 --out data/sft/sft_train.jsonl

Napkin math for the cost line: 60k accepted pairs, ~25% rejection/dup rate → ~75k teacher calls; each call is ~700 prompt tokens + ~350 output tokens. Output tokens dominate wall time: 75k × 350 ≈ 26M generated tokens, at ~2,000 tok/s aggregate ≈ 3.6 hours. At $0.40/hr:

Cost for this lesson: ~2–4 GPU-hours ≈ $1–2. (Add ~$0.15 if you rented a fresh instance and count the vLLM setup time.)

While it runs, spot-check quality — this is your only chance to catch a bad seed prompt before it pollutes 20k examples:

shuf -n 5 data/sft/sft_train.jsonl | python -m json.tool --no-ensure-ascii | less

Read them like an examiner: Is the closed-QA answer actually in the passage? Does the summary paraphrase rather than copy? If one task type looks weak, kill the run, fix its seed prompt, and resume — the script picks up where it left off.

Finally, carve out a held-out eval split (we’ll want it for Lesson 9’s eval loss and Lesson 11’s judge eval):

python - <<'EOF'
import json, random
random.seed(0)
lines = open("data/sft/sft_train.jsonl", encoding="utf-8").readlines()
random.shuffle(lines)
n_val = 1000
open("data/sft/sft_val.jsonl", "w", encoding="utf-8").writelines(lines[:n_val])
open("data/sft/sft_train.jsonl", "w", encoding="utf-8").writelines(lines[n_val:])
print(f"train={len(lines)-n_val} val={n_val}")
EOF

Publish it: a public GitHub dataset repo

A dataset nobody can download is a dataset that doesn’t exist. We publish to a public GitHub repo — separate from wikillm/ (code and data have different lifecycles and licenses). Two file-size realities shape the layout: GitHub blocks files >100MB and warns at 50MB, and 60k pairs is roughly 80–120MB of JSONL. So: shard to <50MB plain-text files (grep-able, diff-able, no special tooling for consumers), with git lfs as the alternative if you prefer one big file.

flowchart LR
    A[sft_train.jsonl<br/>~100MB] --> B[split into<br/>&lt;50MB shards]
    B --> C[README.md<br/>schema · provenance ·<br/>license · repro command]
    C --> D[git init + commit]
    D --> E[gh repo create --public<br/>git push]
    E --> F[Anyone reproduces<br/>your Lesson 9 SFT run]

# on the instance (or rsync the data down and do this locally)
mkdir -p ~/wikigpt-sft-data && cd ~/wikigpt-sft-data

# shard train set at 20k lines/shard (~35MB each, safely under 50MB)
mkdir -p data
split -l 20000 -d --additional-suffix=.jsonl \
    /workspace/wikillm/data/sft/sft_train.jsonl data/sft_train_
cp /workspace/wikillm/data/sft/sft_val.jsonl data/

git init
git add data/

The README is not decoration — it’s the dataset’s schema contract, reproduction recipe, and license notice. Write it before the first push:

cat > README.md <<'EOF'
# WikiGPT-SFT: Synthetic Instruction Data Grounded in English Wikipedia

~60,000 (instruction, response) pairs for supervised fine-tuning of small
language models, generated from cleaned English Wikipedia passages. Built for
WikiGPT-124M (see the "Build Your Own Wikipedia LLM" course) but usable for
any chat SFT.

## Schema
One JSON object per line (`data/sft_train_*.jsonl`, `data/sft_val.jsonl`):
- `id`           unique example id (`sft-NNNNNN`)
- `task_type`    one of: closed_qa (35%), summarize (20%), extract (15%),
                 eli5 (15%), open_qa (15%)
- `source_title` title of the source Wikipedia article
- `messages`     [{role: "user", content}, {role: "assistant", content}]
                 For closed_qa/summarize/extract the user turn embeds the
                 source passage; for eli5/open_qa it does not.

## Provenance & generation
- Source text: English Wikipedia, `pages-articles-multistream` dump
  (dumps.wikimedia.org), extracted, cleaned, and deduplicated.
- Teacher model: Qwen/Qwen2.5-7B-Instruct served locally with vLLM
  (temperature 0.8, JSON mode).
- Filters: length bounds, refusal-string rejection, passage-copy rejection
  (SequenceMatcher ratio > 0.85), exact sha1 + normalized-prefix dedup.
- Generation script: `src/gen_sft_data.py` in the wikillm repo. Reproduce:
  `python src/gen_sft_data.py --n-pairs 60000`

## License
Source passages derive from Wikipedia, licensed **CC BY-SA 4.0**. Because
responses are grounded in and derived from that text, this dataset is
released under **CC BY-SA 4.0** (share-alike inherits). Attribution:
Wikipedia contributors. Generated responses were produced by
Qwen2.5-7B-Instruct; see the Qwen model license for its terms.

## Known limitations
Synthetic data inherits teacher errors: occasional hallucinated details in
eli5/open_qa answers, uneven difficulty. ~1k-pair audit found >95% of
closed_qa answers fully supported by their passage. Use accordingly.
EOF

git add README.md
git commit -m "WikiGPT-SFT v1: 60k grounded synthetic instruction pairs"

# create the public repo and push (gh CLI; or create it in the web UI and add the remote)
gh repo create wikigpt-sft-data --public --source=. --push

If you’d rather keep one un-sharded file, the git lfs route is:

git lfs install
git lfs track "*.jsonl"
git add .gitattributes data/ README.md
git commit -m "WikiGPT-SFT v1 (LFS)" && git push

Plain shards are friendlier to consumers (no LFS client, no LFS bandwidth quota — GitHub’s free LFS tier is 1GB/month of downloads, which a popular dataset burns through fast). That’s why shards are the default here.

The license paragraph is the part most people get wrong, so to be explicit: Wikipedia text is CC BY-SA 4.0, and BY-SA is share-alike — derivatives of the text (which your passages, and arguably the grounded responses, are) must carry the same license. Publishing under CC BY-SA with attribution to Wikipedia contributors is both legally required and zero-cost. You’ll reuse this exact repo pattern in Lesson 10 for the DPO preference pairs.

One housekeeping note before you leave the instance: if you’re done generating, stop the vLLM tmux session (tmux kill-session -t teacher) — or destroy the instance if Lesson 9 is a few days away. Idle teachers bill the same as busy ones.

🧪 Your task

Add a sixth task type, multi_hop: questions that require combining two facts from different parts of the same passage (e.g., “The article says X was born in 1879 and moved to Y in 1905 — how old was X on arrival?”). Write the seed prompt, register it in the mix at 10% (rebalance the others to keep the sum at 1.0), and add one task-specific filter: reject any multi_hop pair whose response is shorter than 100 characters (single-fact answers are almost always a failed multi-hop). Then generate 200 pairs of only this type and manually grade 10.

Solution

Additions to src/gen_sft_data.py:

SEED_PROMPTS["multi_hop"] = [
    "Read the passage and write a question whose answer requires COMBINING "
    "two different facts stated in different sentences of the passage "
    "(comparison, arithmetic on two numbers, or a cause stated in one place "
    "and an effect in another). Then write the answer, explicitly walking "
    "through both facts before concluding.\n\nPassage:\n{passage}\n\n"
    'Return JSON: {{"instruction": "...", "response": "..."}}',
]

TASK_MIX = {"closed_qa": 0.32, "summarize": 0.18, "extract": 0.13,
            "eli5": 0.14, "open_qa": 0.13, "multi_hop": 0.10}
PASSAGE_IN_USER_TURN.add("multi_hop")   # the passage must be visible to the student

The task-specific filter, added at the top of passes_filters (pass task_type through from gen_one):

def passes_filters(inst, resp, passage, task_type=None):
    if task_type == "multi_hop" and len(resp) < 100:
        return "multihop_too_short"
    ...

Generate a type-only batch by temporarily setting TASK_MIX = {"multi_hop": 1.0} (or add a --only-task flag) and running:

python src/gen_sft_data.py --n-pairs 200 --out data/sft/multihop_probe.jsonl
shuf -n 10 data/sft/multihop_probe.jsonl | python -m json.tool --no-ensure-ascii

When grading, the common failure is a disguised single-hop: the question sounds compositional but one sentence of the passage answers it outright. Expect roughly 6–8 of 10 to be genuinely multi-hop with a 7B teacher — good enough at a 10% mix share, and a concrete preview of why Lesson 10 adds preference optimization on top of SFT.

Key takeaways

  • A base model completes; instruction data is what teaches it to converse. You generated your own instead of downloading, matched to the exact corpus your model pretrained on.
  • vLLM turns your rented 4090 into an OpenAI-compatible teacher endpoint (vllm serve Qwen/Qwen2.5-7B-Instruct); memory math (15GB weights + KV cache under --gpu-memory-utilization 0.92, --max-model-len 4096) makes 7B fit in 24GB. Any OpenAI-compatible API is a drop-in via --base-url.
  • Grounded self-instruct = real Wikipedia passage + task-specific seed prompt + temperature 0.8 + JSON mode. Five task types, QA-weighted, with the passage in the user turn only for comprehension-style tasks.
  • Filters are not optional: refusal strings, passage-copy similarity, length bounds, and two-layer dedup are the difference between a dataset and noise. Client-side concurrency (~32) keeps vLLM’s batcher full — a ~10× throughput difference.
  • ~60k pairs cost ~2–4 GPU-hours ≈ $1–2. The whole dataset ships to a public GitHub repo as <50MB JSONL shards with a README covering schema, generation provenance, and the CC BY-SA 4.0 share-alike license Wikipedia derivatives inherit.

Coming up

In Lesson 9 we point this dataset at your base checkpoint: src/train_sft.py renders the chat template with <|user|>/<|assistant|>/<|end|>, masks the loss to assistant tokens only, and in under two GPU-hours turns WikiGPT-124M from an autocomplete engine into a model that answers you.


🏠 📖 Course home  |  ← Lesson 07  |  Lesson 09 →  |  📚 All mini-courses

 

© Kader Mohideen