Day 0 — A 30-Day Curriculum to Become an LLM Inference Engineer

The Roadmap

Four weeks. Four artifacts. One inference engine.

Each week ends with something you've built. By the end, the artifacts compose into a complete LLM inference stack — yours.

Week 1

Days 1 — 7

Foundations

Math fluency drills, tensors and autograd, neural networks from scratch with manual backprop, BPE tokenization, and the attention mechanism — building toward the full Transformer block.

You buildA complete decoder-only LLM forward pass

Week 2

Days 8 — 14

Training & Architecture

How LLMs are pre-trained at scale. Distributed training (DP, FSDP, TP, ZeRO). Modern variations — LLaMA, Mistral, MoE. Fine-tuning with LoRA/QLoRA. Alignment via RLHF and DPO.

You buildA 10M-parameter GPT, trained end-to-end

Week 3

Days 15 — 21

Inference & Hardware

Prefill vs decode. GPU architecture, the roofline model, arithmetic intensity. CUDA programming. Apple Silicon and MLX. KV cache deep dive. FlashAttention v1, v2, v3.

You buildHand-written CUDA + Metal attention kernels

Week 4

Days 22 — 30

Optimization & Capstone

Quantization. Speculative decoding. Continuous batching with PagedAttention. DFlash internals. Production engine comparisons. Then four days building it all into your own engine.

You buildA complete inference engine + benchmarks

The original Transformer paper ("Attention Is All You Need", Vaswani et al. 2017) was written for machine translation, not language modeling. Its title is a nod to The Beatles' "All You Need Is Love." It has since been cited over 130,000 times — making it one of the most-cited scientific papers of the 21st century.

Pick Your Path

Three hardware paths. Same destination.

Every code lesson covers both NVIDIA and Apple Silicon. Cloud is for the heavier days. Pick what's in front of you and start.

Most flexible

⚡

NVIDIA GPU

CUDA · cuBLAS · cuDNN · NCCL

Practical minimumRTX 3060 / 8GB
ComfortableRTX 4080 / 16GB
WorkstationRTX 4090 / 24GB
ProA6000 / 48GB
FrontierBlackwell B200 / 192GB

Best for laptops

🍎

Apple Silicon

MLX · Metal · MPS · Unified Memory

Practical minimumM1 / 16GB
ComfortableM2 Pro / 32GB
WorkstationM3 Max / 64GB
ProM3 Ultra / 128GB
FrontierM3 Ultra / 192GB

Rent as needed

☁️

Cloud GPUs

Lambda · RunPod · Vast — approx rates, early 2026

A10 (24GB)$0.50–0.80/hr
A6000 (48GB)$0.80–1.10/hr
A100 (40GB)$1.10–1.80/hr
A100 (80GB)$1.80–2.50/hr
H100 (80GB)$2.50–4.50/hr
B200 (192GB)$3.50–6.00/hr

An H100 GPU (Hopper, 2022) has 80GB of HBM3 and ~989 trillion dense FP16 operations per second; it launched around $25–40k. It was NVIDIA's flagship until Blackwell (B200, ~2.25 PFLOPS dense BF16, 192GB HBM3e) took over. Either way, a single H100 already has more arithmetic throughput than the supercomputer that trained the original BERT in 2018.

Setup, both paths.

Copy these blocks. Verify the smoke tests pass. Then move on to Day 1.

1 · Verify driver

# Check NVIDIA driver and CUDA version
nvidia-smi

2 · Install uv (or use conda)

curl -LsSf https://astral.sh/uv/install.sh | sh

3 · Create environment + install PyTorch with CUDA

uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install torch torchvision \
    --index-url https://download.pytorch.org/whl/cu128

4 · Smoke test

python -c "import torch; \
    print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

5 · Course essentials

uv pip install numpy transformers datasets tokenizers \
    tiktoken sentencepiece einops matplotlib jupyter

PyTorch installation guidepytorch.org

CUDA Toolkit downloadsnvidia.com

uv documentationastral.sh

nvtop (GPU htop)github.com

1 · Homebrew + Python

brew install python@3.11
curl -LsSf https://astral.sh/uv/install.sh | sh

2 · Create environment

uv venv .venv --python 3.11
source .venv/bin/activate

3 · Install PyTorch (with MPS) + MLX

uv pip install torch torchvision
uv pip install mlx mlx-lm

4 · Smoke tests

python -c "import torch; print('MPS:', torch.backends.mps.is_available())"
python -c "import mlx.core as mx; print('MLX:', mx.default_device())"

5 · Course essentials

uv pip install numpy transformers datasets tokenizers \
    tiktoken sentencepiece einops matplotlib jupyter

MLX Documentationapple

MLX Examples Repogithub.com

PyTorch MPS Notespytorch.org

Metal Programming Guideapple

A Family Tree

Three Transformer family trees. We follow one.

All three families exist; only one dominates modern LLMs. Knowing the shape of the others helps you understand why.

Encoder-only

Bidirectional

Every token attends to every other token, in both directions. Trained with masked language modeling — predict random masked-out tokens.

ExamplesBERT · RoBERTa · DeBERTa · ModernBERT

Decoder-only

Causal · Our path

Each token only attends to previous tokens. Trained on next-token prediction. Generates left-to-right. Simplest, most flexible — and the architecture behind every modern frontier LLM.

ExamplesGPT-2/3/4 · LLaMA · Mistral · Qwen · Gemma · DeepSeek

Encoder-Decoder

Two-stack

One stack reads input bidirectionally; another generates output causally with cross-attention to the encoder. Strong for translation and summarization, less common in modern LLMs.

ExamplesT5 · BART · Original Transformer · FLAN-T5

Why decoder-only won: simpler architecture (one stack), trains on raw text without input/output split, generates and "understands" with the same machinery via prompting, scales remarkably well empirically. The whole curriculum focuses here.

BPE (Byte-Pair Encoding) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and most modern LLMs — was originally invented in 1994 by Philip Gage for data compression. It sat in obscurity for two decades before NLP rediscovered it in 2015.

The Library

Every reference, in one place.

A curated set of free courses, key books, canonical papers, and reference repos. Lessons throughout the curriculum point back here.

🎬

Primary Video Courses

YouTube · Free · ~15 hrs

Karpathy — Neural Networks: Zero to Hero

The single best resource for software engineers transitioning to ML. We follow it closely through Weeks 1-2.

Watch on YouTube

YouTube · Free · 4 hrs

Karpathy — Reproduce GPT-2 (124M)

Live-coding GPT-2 from scratch on a single GPU. Direct relevance to the Day 27-30 capstone.

Watch on YouTube

YouTube · Free · 2 hrs

Karpathy — Build the GPT Tokenizer

BPE tokenization, byte-level subword merges, glitch tokens. Companion to Day 5.

Watch on YouTube

University · Free

Stanford CS336 — Language Modeling from Scratch

Percy Liang's course. The most rigorous open LLM curriculum, with lectures and assignments.

Open course

University · Free

Stanford CS25 — Transformers United

Rotating guest lectures from leading researchers. Frontier topics, recent talks.

Open course

YouTube · Free · ~6 hrs

3Blue1Brown — Neural Networks

The most beautiful neural network animations on the internet. Watch for visual intuition.

Watch series

📚

Books

Manning · 2024 · Paid

Build a Large Language Model From Scratch

Sebastian Raschka. The closest book to this curriculum. Builds a working LLM end-to-end in PyTorch.

Buy at Manning

MIT Press · Free online

Deep Learning

Goodfellow, Bengio, Courville. The reference textbook. Comprehensive coverage of foundations.

Read online

Free online

Neural Networks and Deep Learning

Michael Nielsen. The clearest introduction to backpropagation ever written. Read Chapters 1-2 alongside Day 3-4.

Read online

Free draft online

Speech and Language Processing

Jurafsky & Martin (3rd ed). The NLP reference textbook. Useful context for tokenization and language modeling.

Read draft

Free PDF

Mathematics for Machine Learning

Deisenroth, Faisal, Ong. Concise math primer if Day 1's pace is too fast. Chapters 2, 5, 6 are most relevant.

Download PDF

Morgan Kaufmann · 4th ed.

Programming Massively Parallel Processors

Hwu, Kirk, El Hajj. The CUDA bible. We reference it from Day 16 onward for systems-level GPU work.

Find on Amazon

📄

Canonical Papers

2017 · The one

Attention Is All You Need

Vaswani et al. Introduces the Transformer. Read Section 3 carefully. Day 6-7 companion.

Read on arXiv

2020 · GPT-3

Language Models are Few-Shot Learners

Brown et al. The 175B model that triggered the modern LLM era. In-context learning emerges with scale.

Read on arXiv

2023 · LLaMA

LLaMA: Open and Efficient Foundation Models

Touvron et al. The open-weights model that catalyzed the open-source LLM ecosystem. Modern arch (RMSNorm, RoPE, SwiGLU).

Read on arXiv

2022-24 · FlashAttention

FlashAttention

Tri Dao et al. IO-aware attention. Day 21 deep dive. (V2: 2307.08691, V3: 2407.08608)

Read v1

2023 · vLLM

PagedAttention (vLLM)

Kwon et al. Efficient memory management for LLM serving. The paper behind continuous batching.

Read on arXiv

2022 · Chinchilla

Training Compute-Optimal LLMs

Hoffmann et al. The corrected scaling laws. Most pre-Chinchilla models were dramatically under-trained.

Read on arXiv

2021 · LoRA

LoRA: Low-Rank Adaptation

Hu et al. Parameter-efficient fine-tuning that became the default. Day 13 companion.

Read on arXiv

2023 · DPO

Direct Preference Optimization

Rafailov et al. RLHF without RL. Cleaner, more stable, increasingly the default for alignment.

Read on arXiv

2022 · Speculative

Speculative Decoding

Leviathan et al. Use a small "draft" model to propose tokens; verify with the big model. Day 23 companion.

Read on arXiv

💾

Reference Implementations

~150 lines · Python

karpathy/micrograd

Autograd from scratch. The whole concept of automatic differentiation in 150 lines you can read in one sitting.

Open on GitHub

~600 lines · Python

karpathy/nanoGPT

GPT training, minimal. Trains a Shakespeare-quality model in ~3 minutes on a single A100. The cleanest LLM training code in existence.

Open on GitHub

~5000 lines · Pure C/CUDA

karpathy/llm.c

GPT-2 in raw C with hand-written CUDA kernels. Outperforms PyTorch on the same hardware. Required reading before the capstone.

Open on GitHub

~700 lines · Pure C

karpathy/llama2.c

LLaMA inference in pure C, no dependencies. Single file. Builds the mental model of inference at the most stripped-down level.

Open on GitHub

~50K lines · C++

ggml-org/llama.cpp

Production-quality CPU/Metal/CUDA inference for the GGUF format. Started by Georgi Gerganov. Powers most local-first LLM apps.

Open on GitHub

~50K lines · Python

vllm-project/vllm

The most popular open inference server. Continuous batching, PagedAttention, prefix caching. Day 24 + capstone benchmark target.

Open on GitHub

Apple · Swift/C++/Python

ml-explore/mlx

Apple's array framework, designed from scratch for unified memory architecture. Companion examples repo at mlx-examples.

Open on GitHub

~250K lines · Python

huggingface/transformers

The production reference implementation of every major model. Large but searchable; the place to look up "what does X actually do."

Open on GitHub

Day 25 target

z-lab/dflash

The repo we'll dissect on Day 25. We'll explain block-diffusion speculative decoding and where it fits relative to draft/verify decoding and the capstone scheduler.

Open on GitHub

🌐

Blogs & Reference Sites

Visual

Jay Alammar — The Illustrated Transformer

The most beautiful visual walkthrough of the Transformer ever made. Read it before Day 6.

Open blog

Code-along

Harvard NLP — The Annotated Transformer

The 2017 paper presented side-by-side with working PyTorch code. Foundational walkthrough.

Open blog

Survey-style

Lilian Weng's Blog

OpenAI's Lilian Weng writes survey-style deep dives on every major LLM topic. Densely useful.

Open blog

Substack

Sebastian Raschka — Magazine

Monthly deep dives on LLM training, fine-tuning, and architecture. Among the most rigorous popular ML writing.

Open substack

Stanford

HazyResearch Blog

Chris Ré's group at Stanford. Source of FlashAttention, Mamba, Hyena, and many systems-for-ML innovations.

Open blog

NVIDIA

NVIDIA Developer Blog

The canonical source for CUDA, TensorRT-LLM, and GPU performance writeups. Filter for "LLM" tag.

Open blog

Field Notes

A few things worth knowing as you set out.

GPT stands for "Generative Pre-trained Transformer." LLaMA stands for "Large Language Model Meta AI." BERT stands for "Bidirectional Encoder Representations from Transformers."

The original GPT (2018) had 117M parameters. GPT-4 reportedly has ~1.8 trillion in a mixture-of-experts setup — roughly 15,000× larger in 6 years.

The first NVIDIA GPU with Tensor Cores (V100) shipped in 2017 — the same year the Transformer paper was published. The hardware that would power LLMs arrived simultaneously with the algorithm.

vLLM — the most popular open inference engine — was built by three PhD students at UC Berkeley's Sky Computing Lab as part of a research paper. It now serves billions of requests daily.

Karpathy's micrograd is only ~150 lines of Python, yet it re-implements PyTorch's autograd from first principles. It's the warm-up exercise for "Neural Networks: Zero to Hero."

nanoGPT trains a Shakespeare-quality model in ~3 minutes on a single A100. The original GPT-2 took weeks on a fleet of V100s.

FlashAttention was developed by Tri Dao as part of his PhD thesis at Stanford. The first paper changed how every modern LLM is served.

The original Transformer was trained on 8 NVIDIA P100 GPUs over 3.5 days. Today, that compute would cost about $200 on a cloud platform.

The phrase "KV cache" wasn't widely used until ~2020. Before that, people just called it "stored states" or "saved activations" — though the technique is as old as autoregressive Transformers.

How to become an LLM Inference Engineer in thirty days.

Where you start, where you'll end up.

Four weeks. Four artifacts. One inference engine.

Foundations

Training & Architecture

Inference & Hardware

Optimization & Capstone

From 100 thousand parameters to one trillion.

Three hardware paths. Same destination.

NVIDIA GPU

Apple Silicon

Cloud GPUs

Setup, both paths.

1 · Verify driver

2 · Install uv (or use conda)

3 · Create environment + install PyTorch with CUDA

4 · Smoke test

5 · Course essentials

1 · Homebrew + Python

2 · Create environment

3 · Install PyTorch (with MPS) + MLX

4 · Smoke tests

5 · Course essentials

Three Transformer family trees. We follow one.

Encoder-only

Decoder-only

Encoder-Decoder

Thirty lessons. One per day.

Week 1 · Foundations

Week 2 · Training & Architectures

Week 3 · Inference & Hardware

Week 4 · Optimization & Capstone

Every reference, in one place.

Primary Video Courses

Books

Programming Massively Parallel Processors

Canonical Papers

Reference Implementations

Blogs & Reference Sites

Field Notes

Six tips, five pitfalls, one mindset.

Tips that actually help

Common pitfalls