📖 Build Your Own Wikipedia LLM

From a raw Wikipedia dump to your own instruction-tuned chat model — data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving — all on rented vast.ai GPUs for ~$15–30.

11-Lesson mini-course — From a raw Wikipedia dump to your own instruction-tuned chat model — data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving — all on rented vast.ai GPUs for ~$15–30.

Like every course on this site, each lesson is deep and code-first: an intuition-first explainer, complete runnable scripts with the methodology explained line by line, visuals, and a 🧪 Your task exercise with a hidden solution. By the final lesson you will have pretrained, instruction-tuned, and shipped WikiGPT-124M — your own model — for roughly the price of a pizza. Theory lives in the AI & ML Encyclopedia; here you build.

▶ Start Lesson 1 📚 All mini-courses

Syllabus

#	Lesson
Lesson 1	The Mission, the Machine, and the Map
Lesson 2	Getting Wikipedia: The Dump and the Extractor
Lesson 3	The Cleaning Pipeline: From Raw Extract to Training Corpus
Lesson 4	Tokenizer: Training BPE on Your Corpus
Lesson 5	WikiGPT-124M: The Model, Line by Line
Lesson 6	The Pretraining Engine: train.py, Fast and Restartable
Lesson 7	The Big Run: Launch, Babysit, Evaluate the Base Model
Lesson 8	Synthetic Instruction Data: Your Teacher, Your Dataset, on GitHub
Lesson 9	SFT: Teaching WikiGPT to Follow Instructions
Lesson 10	Preference Data and DPO: From Helpful to Preferred
Lesson 11	Ship It: Your Instruction Model, Served and Shared

What you will walk away with

A cleaned ~4B-token Wikipedia corpus and a streaming cleaning/dedup pipeline you wrote yourself
A custom 32k BPE tokenizer and packed training shards
WikiGPT-124M — a modern (RMSNorm · RoPE · SwiGLU) decoder you pretrained from scratch, monitored in Weights & Biases
A synthetic instruction dataset you generated and published to GitHub, an SFT model, and a DPO-aligned final instruction model, served over an API and shared on the Hugging Face Hub
The complete vast.ai playbook — renting, tmux discipline, resuming after preemption, and an end-to-end bill of ~$15–30