A self-paced journey from "software engineer who uses NumPy occasionally" to writing your own production-grade inference engine — the math, the architecture, the silicon, all of it.
No magic. No vague "exposure." A concrete arc with a concrete destination — and a complete inference engine you wrote yourself, sitting in a git repo at the end of it.
A software engineer comfortable with Python and Git. NumPy rings a bell. "Backprop" sounds vaguely familiar. You've used ChatGPT but never wondered how it actually generates a token.
You can read source code from vLLM, llama.cpp, MLX-LM, TensorRT-LLM and follow every layer — math, attention, kernels, drivers. You've written your own inference engine. You know exactly why it's slower than vLLM — and what it would take to close the gap.
Each week ends with something you've built. By the end, the artifacts compose into a complete LLM inference stack — yours.
Math fluency drills, tensors and autograd, neural networks from scratch with manual backprop, BPE tokenization, and the attention mechanism — building toward the full Transformer block.
How LLMs are pre-trained at scale. Distributed training (DP, FSDP, TP, ZeRO). Modern variations — LLaMA, Mistral, MoE. Fine-tuning with LoRA/QLoRA. Alignment via RLHF and DPO.
Prefill vs decode. GPU architecture, the roofline model, arithmetic intensity. CUDA programming. Apple Silicon and MLX. KV cache deep dive. FlashAttention v1, v2, v3.
Quantization. Speculative decoding. Continuous batching with PagedAttention. DFlash internals. Production engine comparisons. Then four days building it all into your own engine.
The original Transformer paper ("Attention Is All You Need", Vaswani et al. 2017) was written for machine translation, not language modeling. Its title is a nod to The Beatles' "All You Need Is Love." It has since been cited over 130,000 times — making it one of the most-cited scientific papers of the 21st century.
You'll work at every scale — building tiny models you fully understand, then growing into the architectural patterns that power frontier systems.
Approximate parameter counts. Bars use a square-root scale for visual clarity — true ratios are even more dramatic.
Every code lesson covers both NVIDIA and Apple Silicon. Cloud is for the heavier days. Pick what's in front of you and start.
CUDA · cuBLAS · cuDNN · NCCL
MLX · Metal · MPS · Unified Memory
Lambda · RunPod · Vast — approx rates, early 2026
An H100 GPU (Hopper, 2022) has 80GB of HBM3 and ~989 trillion dense FP16 operations per second; it launched around $25–40k. It was NVIDIA's flagship until Blackwell (B200, ~2.25 PFLOPS dense BF16, 192GB HBM3e) took over. Either way, a single H100 already has more arithmetic throughput than the supercomputer that trained the original BERT in 2018.
Copy these blocks. Verify the smoke tests pass. Then move on to Day 1.
# Check NVIDIA driver and CUDA version
nvidia-smi
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install torch torchvision \
--index-url https://download.pytorch.org/whl/cu128
python -c "import torch; \
print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"
uv pip install numpy transformers datasets tokenizers \
tiktoken sentencepiece einops matplotlib jupyter
brew install python@3.11
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install torch torchvision
uv pip install mlx mlx-lm
python -c "import torch; print('MPS:', torch.backends.mps.is_available())"
python -c "import mlx.core as mx; print('MLX:', mx.default_device())"
uv pip install numpy transformers datasets tokenizers \
tiktoken sentencepiece einops matplotlib jupyter
All three families exist; only one dominates modern LLMs. Knowing the shape of the others helps you understand why.
Every token attends to every other token, in both directions. Trained with masked language modeling — predict random masked-out tokens.
Each token only attends to previous tokens. Trained on next-token prediction. Generates left-to-right. Simplest, most flexible — and the architecture behind every modern frontier LLM.
One stack reads input bidirectionally; another generates output causally with cross-attention to the encoder. Strong for translation and summarization, less common in modern LLMs.
BPE (Byte-Pair Encoding) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and most modern LLMs — was originally invented in 1994 by Philip Gage for data compression. It sat in obscurity for two decades before NLP rediscovered it in 2015.
Each lesson is self-contained: concept, math, code, exercise, self-check, references. Marked lessons are ● built and ready to read.
Build the minimum toolkit: notation, tensors, autograd, neural networks, tokenization, attention, and the decoder-only block that every later system optimizes.
Move from components to models: pre-training objectives, a tiny GPT, production training loops, distributed training, modern LLM architecture choices, and post-training.
Turn the trained model into a serving workload: prefill/decode, GPU rooflines, CUDA, Apple Silicon, KV-cache memory, and FlashAttention's IO-aware execution.
Finish as an inference engineer: quantization, speculative decoding, continuous batching, production engine tradeoffs, and a four-part capstone engine with benchmarking.
"The most effective form of learning is to re-derive things from scratch. The papers, the videos, the textbooks — they're scaffolding. Understanding only crystallizes when you've made it work yourself."
A curated set of free courses, key books, canonical papers, and reference repos. Lessons throughout the curriculum point back here.
The single best resource for software engineers transitioning to ML. We follow it closely through Weeks 1-2.
Watch on YouTubeLive-coding GPT-2 from scratch on a single GPU. Direct relevance to the Day 27-30 capstone.
Watch on YouTubeBPE tokenization, byte-level subword merges, glitch tokens. Companion to Day 5.
Watch on YouTubePercy Liang's course. The most rigorous open LLM curriculum, with lectures and assignments.
Open courseRotating guest lectures from leading researchers. Frontier topics, recent talks.
Open courseThe most beautiful neural network animations on the internet. Watch for visual intuition.
Watch seriesSebastian Raschka. The closest book to this curriculum. Builds a working LLM end-to-end in PyTorch.
Buy at ManningGoodfellow, Bengio, Courville. The reference textbook. Comprehensive coverage of foundations.
Read onlineMichael Nielsen. The clearest introduction to backpropagation ever written. Read Chapters 1-2 alongside Day 3-4.
Read onlineJurafsky & Martin (3rd ed). The NLP reference textbook. Useful context for tokenization and language modeling.
Read draftDeisenroth, Faisal, Ong. Concise math primer if Day 1's pace is too fast. Chapters 2, 5, 6 are most relevant.
Download PDFHwu, Kirk, El Hajj. The CUDA bible. We reference it from Day 16 onward for systems-level GPU work.
Find on AmazonVaswani et al. Introduces the Transformer. Read Section 3 carefully. Day 6-7 companion.
Read on arXivBrown et al. The 175B model that triggered the modern LLM era. In-context learning emerges with scale.
Read on arXivTouvron et al. The open-weights model that catalyzed the open-source LLM ecosystem. Modern arch (RMSNorm, RoPE, SwiGLU).
Read on arXivTri Dao et al. IO-aware attention. Day 21 deep dive. (V2: 2307.08691, V3: 2407.08608)
Read v1Kwon et al. Efficient memory management for LLM serving. The paper behind continuous batching.
Read on arXivHoffmann et al. The corrected scaling laws. Most pre-Chinchilla models were dramatically under-trained.
Read on arXivHu et al. Parameter-efficient fine-tuning that became the default. Day 13 companion.
Read on arXivRafailov et al. RLHF without RL. Cleaner, more stable, increasingly the default for alignment.
Read on arXivLeviathan et al. Use a small "draft" model to propose tokens; verify with the big model. Day 23 companion.
Read on arXivAutograd from scratch. The whole concept of automatic differentiation in 150 lines you can read in one sitting.
Open on GitHubGPT training, minimal. Trains a Shakespeare-quality model in ~3 minutes on a single A100. The cleanest LLM training code in existence.
Open on GitHubGPT-2 in raw C with hand-written CUDA kernels. Outperforms PyTorch on the same hardware. Required reading before the capstone.
Open on GitHubLLaMA inference in pure C, no dependencies. Single file. Builds the mental model of inference at the most stripped-down level.
Open on GitHubProduction-quality CPU/Metal/CUDA inference for the GGUF format. Started by Georgi Gerganov. Powers most local-first LLM apps.
Open on GitHubThe most popular open inference server. Continuous batching, PagedAttention, prefix caching. Day 24 + capstone benchmark target.
Open on GitHubApple's array framework, designed from scratch for unified memory architecture. Companion examples repo at mlx-examples.
Open on GitHubThe production reference implementation of every major model. Large but searchable; the place to look up "what does X actually do."
Open on GitHubThe repo we'll dissect on Day 25. We'll explain block-diffusion speculative decoding and where it fits relative to draft/verify decoding and the capstone scheduler.
Open on GitHubThe most beautiful visual walkthrough of the Transformer ever made. Read it before Day 6.
Open blogThe 2017 paper presented side-by-side with working PyTorch code. Foundational walkthrough.
Open blogOpenAI's Lilian Weng writes survey-style deep dives on every major LLM topic. Densely useful.
Open blogMonthly deep dives on LLM training, fine-tuning, and architecture. Among the most rigorous popular ML writing.
Open substackChris Ré's group at Stanford. Source of FlashAttention, Mamba, Hyena, and many systems-for-ML innovations.
Open blogThe canonical source for CUDA, TensorRT-LLM, and GPU performance writeups. Filter for "LLM" tag.
Open blogA few things worth knowing as you set out.
ML systems stack abstractions unusually deep — math, tensors, autograd, frameworks, kernels, drivers, silicon. You'll be confused often. The goal is to get productively confused.