Kader Mohideen
  • About
  • Blog
  • Projects
  • Health
  • Mini Courses
  • Extra
    • AI & ML Encyclopedia
    • Interview Guide
    • AI Interview Prep
    • Book References
    • Quest for AGI
    • AI Papers
    • Lupus

๐Ÿ“– Build Your Own Wikipedia LLM

From a raw Wikipedia dump to your own instruction-tuned chat model โ€” data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving โ€” all on rented vast.ai GPUs for ~$15โ€“30.

11-Lesson mini-course โ€” From a raw Wikipedia dump to your own instruction-tuned chat model โ€” data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving โ€” all on rented vast.ai GPUs for ~$15โ€“30.

Like every course on this site, each lesson is deep and code-first: an intuition-first explainer, complete runnable scripts with the methodology explained line by line, visuals, and a ๐Ÿงช Your task exercise with a hidden solution. By the final lesson you will have pretrained, instruction-tuned, and shipped WikiGPT-124M โ€” your own model โ€” for roughly the price of a pizza. Theory lives in the AI & ML Encyclopedia; here you build.

โ–ถ Start Lesson 1   ๐Ÿ“š All mini-courses

Syllabus

# Lesson
Lesson 1 The Mission, the Machine, and the Map
Lesson 2 Getting Wikipedia: The Dump and the Extractor
Lesson 3 The Cleaning Pipeline: From Raw Extract to Training Corpus
Lesson 4 Tokenizer: Training BPE on Your Corpus
Lesson 5 WikiGPT-124M: The Model, Line by Line
Lesson 6 The Pretraining Engine: train.py, Fast and Restartable
Lesson 7 The Big Run: Launch, Babysit, Evaluate the Base Model
Lesson 8 Synthetic Instruction Data: Your Teacher, Your Dataset, on GitHub
Lesson 9 SFT: Teaching WikiGPT to Follow Instructions
Lesson 10 Preference Data and DPO: From Helpful to Preferred
Lesson 11 Ship It: Your Instruction Model, Served and Shared

What you will walk away with

  • A cleaned ~4B-token Wikipedia corpus and a streaming cleaning/dedup pipeline you wrote yourself
  • A custom 32k BPE tokenizer and packed training shards
  • WikiGPT-124M โ€” a modern (RMSNorm ยท RoPE ยท SwiGLU) decoder you pretrained from scratch, monitored in Weights & Biases
  • A synthetic instruction dataset you generated and published to GitHub, an SFT model, and a DPO-aligned final instruction model, served over an API and shared on the Hugging Face Hub
  • The complete vast.ai playbook โ€” renting, tmux discipline, resuming after preemption, and an end-to-end bill of ~$15โ€“30
 

ยฉ Kader Mohideen