Data Science Roadmap — From print(‘hello’) to Production LLMs

Data Science
Machine Learning
Education
Python
LLM
A 31-module open-source data-science course you can finish in a weekend or stretch over a month.
Author

Kader Mohideen

Published

May 7, 2026

From print('hello') to Production LLMs

A 31-module open-source data-science course you can finish in a weekend or stretch over a month.


Why I built this

Most “learn data science” courses do one of two things badly:

  • They lock the good stuff behind a subscription and split it across three separate courses (and never talk to each other).
  • Or they jump from print("hello world") straight to a Kaggle notebook, with no in-between.

This repo is the in-between. 31 deep-dive notebooks, every one runnable in Google Colab in one click, every one paired with a colour-coded explanation document. From your first variable to reading the inference code of a 671-billion-parameter LLM.

Who it’s for

  • Beginners who want a structured path that doesn’t skip steps.
  • Self-taught coders who can write Python but want to fill the gaps in Pandas, NumPy, scikit-learn — and beyond.
  • Career-switchers building a portfolio. The capstone notebook (Module 16) and the production modules (M29-M31) are portfolio-ready as-is.

What’s inside — six parts

Part 1 · Python for Data Science (Modules 1–5)

Variables, data structures, OOP, file I/O, NumPy, Pandas, APIs, web scraping. The alphabet of every later module.

Part 2 · Data Visualization (Modules 6–10)

Matplotlib’s object-oriented API; the seven core chart types; specialised tools (waffle, word cloud, Folium maps); animation and Plotly; building dashboards that tell one cohesive story.

Part 3 · Data Analysis & ML Foundations (Modules 11–16)

The universal workflow: import → wrangle → explore → model → evaluate → communicate. Built around a shared dataset (auto-mpg) so each step builds on the last, then validated end-to-end on California Housing.

Part 4 · Machine Learning & AI (Modules 17–22)

PyTorch fundamentals; the six core model archetypes (Linear, Logistic, K-Means, MLP, CNN, Transformer LM); self-attention from scratch with d_model = 2 so every matrix is hand-checkable; multi-head + causal attention; diffusion models on a 2D toy; time-series forecasting with ARIMA, Prophet, and LSTM.

Part 5 · AI-Research Foundations (Modules 23–25)

The math under every neural network (functions, derivatives, gradients, matrices, probability) plus a deep PyTorch primer. A guided tour of DeepSeek-V3’s actual inference code (RMSNorm, RoPE, Multi-Latent Attention, Mixture-of-Experts). Fine-tuning examples — full fine-tuning, LoRA, QLoRA, and SFT with TRL.

Part 6 · Practitioner Skills (Modules 26–31)

The day-to-day skills a working data scientist or ML engineer uses but most courses skip:

  • SQL — JOINs, CTEs, window functions, the SQL ↔︎ Pandas bridge
  • Tree-based models — Random Forest, XGBoost, LightGBM, SHAP for interpretation
  • A/B testing — proportion z-test, sample-size calc, Bonferroni / BH correction, the peeking trap
  • MLOps — FastAPI, Docker, MLflow, drift monitoring with KS + PSI
  • RAG & vector search — embeddings, Chroma, hybrid BM25 + vector, reranker, grounded answers
  • Prompt engineering & LLM eval — few-shot, chain-of-thought, ReAct, structured outputs, LLM-as-judge

What makes it different

This course Typical course
Production architecture depth DeepSeek-V3 dissection “Transformers exist”
Math integrated with code Yes (Module 23) usually skipped
Practical skills (SQL, A/B, MLOps) Modules 26-29 rarely covered
Companion docs Line-by-line, colour-coded, ~30 pages each None
Cost Free, MIT-licensed $40-300/month

How to use it

Option A — Colab. Click any badge in the README, hit Save a copy in Drive, run the cells. Zero install.

Option B — Locally.

git clone https://github.com/kader-xai/data-science-roadmap.git
cd data-science-roadmap
pip install jupyter numpy pandas scikit-learn torch transformers
jupyter notebook

What you walk away with

After the 31 modules you can:

  • Write any Python program and load data from any source.
  • Build classical ML models (regression, gradient boosting) AND modern AI models (transformers, diffusion).
  • Read production LLM source code (DeepSeek, Llama, Mistral, Qwen).
  • Fine-tune any open-weight model on your own data with LoRA.
  • Ship a model behind FastAPI + Docker with MLflow tracking.
  • Build a working RAG pipeline with vector search.
  • A/B-test prompts and evaluate LLMs scientifically.

That’s effectively a 2026 ML-engineer career, built from print('hello').


Repo: github.com/kader-xai/data-science-roadmap Live site: kader-xai.github.io/data-science-roadmap License: MIT


Module index

For anyone scanning to find a specific topic — here’s the full module list with a one-liner each.

Part 1 · Python for Data Science

# Module Topic
01 Python Basics variables, types, strings, format strings, debugging
02 Data Structures lists, tuples, dicts, sets, comprehensions
03 Programming Fundamentals conditionals, loops, functions, exceptions, OOP
04 Working with Data files, CSV/JSON, NumPy arrays, Pandas DataFrames
05 APIs & Web Scraping requests, BeautifulSoup, pd.read_html, yfinance

Part 2 · Data Visualization

# Module Topic
06 Intro to Visualization Matplotlib OO API, line plots, styling
07 Basic Charts bar, hist, pie, box, scatter, bubble, area
08 Specialized Viz waffle, word cloud, regression plot, Folium
09 Advanced Viz subplots, time-series patterns, animation, Plotly
10 Dashboards & Storytelling composing charts to answer one question

Part 3 · Data Analysis & ML Foundations

# Module Topic
11 Importing Data CSV, Excel, JSON, SQL, web; the 5-line inspection ritual
12 Data Wrangling missing values, scaling, binning, encoding, outliers
13 Exploratory Data Analysis distributions, correlations, group-bys, pivot tables
14 Model Development linear / multiple / polynomial regression with Pipelines
15 Model Evaluation MSE/RMSE/MAE/R², CV, Ridge & Lasso, GridSearch
16 Capstone California Housing end-to-end with Random Forest

Part 4 · Machine Learning & AI (deeper dive)

PyTorch fundamentals; the six core archetypes (Linear, Logistic, K-Means, MLP, CNN, Transformer LM); self-attention from scratch; multi-head + causal attention; diffusion models on a 2D toy; time-series with ARIMA, Prophet, and LSTM.

Part 5 · AI-Research Foundations

Math foundations integrated with code, a deep PyTorch primer, a guided dissection of DeepSeek-V3’s actual inference code (RMSNorm, RoPE, Multi-Latent Attention, Mixture-of-Experts), and worked fine-tuning examples — full fine-tuning, LoRA, QLoRA, and SFT with TRL.

Part 6 · Practitioner Skills

The day-to-day skills most courses skip — SQL · tree-based models with SHAP · A/B testing · MLOps with FastAPI/Docker/MLflow · RAG with vector search · prompt engineering and LLM eval.

Try it

The fastest path in is:

  1. Open Module 1 in Colab
  2. Click File → Save a copy in Drive
  3. Run cells with Shift+Enter

If you only want the retrieval part of the AI track without training a model, jump to Modules 30–31 — the RAG and prompt-engineering notebooks stand on their own.