๐ Build Your Own Wikipedia LLM
From a raw Wikipedia dump to your own instruction-tuned chat model โ data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving โ all on rented vast.ai GPUs for ~$15โ30.
11-Lesson mini-course โ From a raw Wikipedia dump to your own instruction-tuned chat model โ data cleaning, a from-scratch PyTorch pretrain, W&B monitoring, synthetic SFT data on GitHub, DPO, and serving โ all on rented vast.ai GPUs for ~$15โ30.
Like every course on this site, each lesson is deep and code-first: an intuition-first explainer, complete runnable scripts with the methodology explained line by line, visuals, and a ๐งช Your task exercise with a hidden solution. By the final lesson you will have pretrained, instruction-tuned, and shipped WikiGPT-124M โ your own model โ for roughly the price of a pizza. Theory lives in the AI & ML Encyclopedia; here you build.
โถ Start Lesson 1 ๐ All mini-courses
Syllabus
What you will walk away with
- A cleaned ~4B-token Wikipedia corpus and a streaming cleaning/dedup pipeline you wrote yourself
- A custom 32k BPE tokenizer and packed training shards
- WikiGPT-124M โ a modern (RMSNorm ยท RoPE ยท SwiGLU) decoder you pretrained from scratch, monitored in Weights & Biases
- A synthetic instruction dataset you generated and published to GitHub, an SFT model, and a DPO-aligned final instruction model, served over an API and shared on the Hugging Face Hub
- The complete vast.ai playbook โ renting, tmux discipline, resuming after preemption, and an end-to-end bill of ~$15โ30