LLM Inference Engineer · Day 0
A Daily Curriculum · Edition 01

How to become an LLM Inference Engineer in thirty days.

A self-paced journey from "software engineer who uses NumPy occasionally" to writing your own production-grade inference engine — the math, the architecture, the silicon, all of it.

30
Daily lessons
4
Weeks
2
Hardware paths
~60h
Total study
The Promise

Where you start, where you'll end up.

No magic. No vague "exposure." A concrete arc with a concrete destination — and a complete inference engine you wrote yourself, sitting in a git repo at the end of it.

Day 1, Hour 0

A software engineer comfortable with Python and Git. NumPy rings a bell. "Backprop" sounds vaguely familiar. You've used ChatGPT but never wondered how it actually generates a token.

Day 30, Hour Last

You can read source code from vLLM, llama.cpp, MLX-LM, TensorRT-LLM and follow every layer — math, attention, kernels, drivers. You've written your own inference engine. You know exactly why it's slower than vLLM — and what it would take to close the gap.

The Roadmap

Four weeks. Four artifacts. One inference engine.

Each week ends with something you've built. By the end, the artifacts compose into a complete LLM inference stack — yours.

Week 1
Days 1 — 7

Foundations

Math fluency drills, tensors and autograd, neural networks from scratch with manual backprop, BPE tokenization, and the attention mechanism — building toward the full Transformer block.

You buildA complete decoder-only LLM forward pass
Week 2
Days 8 — 14

Training & Architecture

How LLMs are pre-trained at scale. Distributed training (DP, FSDP, TP, ZeRO). Modern variations — LLaMA, Mistral, MoE. Fine-tuning with LoRA/QLoRA. Alignment via RLHF and DPO.

You buildA 10M-parameter GPT, trained end-to-end
Week 3
Days 15 — 21

Inference & Hardware

Prefill vs decode. GPU architecture, the roofline model, arithmetic intensity. CUDA programming. Apple Silicon and MLX. KV cache deep dive. FlashAttention v1, v2, v3.

You buildHand-written CUDA + Metal attention kernels
Week 4
Days 22 — 30

Optimization & Capstone

Quantization. Speculative decoding. Continuous batching with PagedAttention. DFlash internals. Production engine comparisons. Then four days building it all into your own engine.

You buildA complete inference engine + benchmarks

The original Transformer paper ("Attention Is All You Need", Vaswani et al. 2017) was written for machine translation, not language modeling. Its title is a nod to The Beatles' "All You Need Is Love." It has since been cited over 130,000 times — making it one of the most-cited scientific papers of the 21st century.

The Scale Ladder

From 100 thousand parameters to one trillion.

You'll work at every scale — building tiny models you fully understand, then growing into the architectural patterns that power frontier systems.

Day 9 — Tiny GPT What you'll train yourself
10M params
GPT-2 small (2019) The classic 124M baseline
124M params
LLaMA 7B / Mistral 7B Runs on a single GPU
7B params
LLaMA 70B Runs on 1× H100 with quantization
70B params
LLaMA 3.1 405B Open-weights frontier
405B params
GPT-4 (estimated) Mixture of experts
~1.8T params

Approximate parameter counts. Bars use a square-root scale for visual clarity — true ratios are even more dramatic.

Pick Your Path

Three hardware paths. Same destination.

Every code lesson covers both NVIDIA and Apple Silicon. Cloud is for the heavier days. Pick what's in front of you and start.

Most flexible

NVIDIA GPU

CUDA · cuBLAS · cuDNN · NCCL

  • Practical minimumRTX 3060 / 8GB
  • ComfortableRTX 4080 / 16GB
  • WorkstationRTX 4090 / 24GB
  • ProA6000 / 48GB
  • FrontierBlackwell B200 / 192GB
Best for laptops
🍎

Apple Silicon

MLX · Metal · MPS · Unified Memory

  • Practical minimumM1 / 16GB
  • ComfortableM2 Pro / 32GB
  • WorkstationM3 Max / 64GB
  • ProM3 Ultra / 128GB
  • FrontierM3 Ultra / 192GB
Rent as needed
☁️

Cloud GPUs

Lambda · RunPod · Vast — approx rates, early 2026

  • A10 (24GB)$0.50–0.80/hr
  • A6000 (48GB)$0.80–1.10/hr
  • A100 (40GB)$1.10–1.80/hr
  • A100 (80GB)$1.80–2.50/hr
  • H100 (80GB)$2.50–4.50/hr
  • B200 (192GB)$3.50–6.00/hr

An H100 GPU (Hopper, 2022) has 80GB of HBM3 and ~989 trillion dense FP16 operations per second; it launched around $25–40k. It was NVIDIA's flagship until Blackwell (B200, ~2.25 PFLOPS dense BF16, 192GB HBM3e) took over. Either way, a single H100 already has more arithmetic throughput than the supercomputer that trained the original BERT in 2018.

Setup, both paths.

Copy these blocks. Verify the smoke tests pass. Then move on to Day 1.

1 · Verify driver

# Check NVIDIA driver and CUDA version
nvidia-smi

2 · Install uv (or use conda)

curl -LsSf https://astral.sh/uv/install.sh | sh

3 · Create environment + install PyTorch with CUDA

uv venv .venv --python 3.11
source .venv/bin/activate
uv pip install torch torchvision \
    --index-url https://download.pytorch.org/whl/cu128

4 · Smoke test

python -c "import torch; \
    print(torch.cuda.is_available(), torch.cuda.get_device_name(0))"

5 · Course essentials

uv pip install numpy transformers datasets tokenizers \
    tiktoken sentencepiece einops matplotlib jupyter

1 · Homebrew + Python

brew install python@3.11
curl -LsSf https://astral.sh/uv/install.sh | sh

2 · Create environment

uv venv .venv --python 3.11
source .venv/bin/activate

3 · Install PyTorch (with MPS) + MLX

uv pip install torch torchvision
uv pip install mlx mlx-lm

4 · Smoke tests

python -c "import torch; print('MPS:', torch.backends.mps.is_available())"
python -c "import mlx.core as mx; print('MLX:', mx.default_device())"

5 · Course essentials

uv pip install numpy transformers datasets tokenizers \
    tiktoken sentencepiece einops matplotlib jupyter
A Family Tree

Three Transformer family trees. We follow one.

All three families exist; only one dominates modern LLMs. Knowing the shape of the others helps you understand why.

Encoder-only

Bidirectional

Every token attends to every other token, in both directions. Trained with masked language modeling — predict random masked-out tokens.

ExamplesBERT · RoBERTa · DeBERTa · ModernBERT

Decoder-only

Causal · Our path

Each token only attends to previous tokens. Trained on next-token prediction. Generates left-to-right. Simplest, most flexible — and the architecture behind every modern frontier LLM.

ExamplesGPT-2/3/4 · LLaMA · Mistral · Qwen · Gemma · DeepSeek

Encoder-Decoder

Two-stack

One stack reads input bidirectionally; another generates output causally with cross-attention to the encoder. Strong for translation and summarization, less common in modern LLMs.

ExamplesT5 · BART · Original Transformer · FLAN-T5
Why decoder-only won: simpler architecture (one stack), trains on raw text without input/output split, generates and "understands" with the same machinery via prompting, scales remarkably well empirically. The whole curriculum focuses here.

BPE (Byte-Pair Encoding) — the tokenization algorithm used by GPT-2, GPT-3, GPT-4, LLaMA, Mistral, and most modern LLMs — was originally invented in 1994 by Philip Gage for data compression. It sat in obscurity for two decades before NLP rediscovered it in 2015.

The Lesson Plan

Thirty lessons. One per day.

Each lesson is self-contained: concept, math, code, exercise, self-check, references. Marked lessons are ● built and ready to read.

"The most effective form of learning is to re-derive things from scratch. The papers, the videos, the textbooks — they're scaffolding. Understanding only crystallizes when you've made it work yourself."

Andrej Karpathy · Neural Networks: Zero to Hero
The Library

Every reference, in one place.

A curated set of free courses, key books, canonical papers, and reference repos. Lessons throughout the curriculum point back here.

🎬

Primary Video Courses

YouTube · Free · ~15 hrs

Karpathy — Neural Networks: Zero to Hero

The single best resource for software engineers transitioning to ML. We follow it closely through Weeks 1-2.

Watch on YouTube
YouTube · Free · 4 hrs

Karpathy — Reproduce GPT-2 (124M)

Live-coding GPT-2 from scratch on a single GPU. Direct relevance to the Day 27-30 capstone.

Watch on YouTube
YouTube · Free · 2 hrs

Karpathy — Build the GPT Tokenizer

BPE tokenization, byte-level subword merges, glitch tokens. Companion to Day 5.

Watch on YouTube
University · Free

Stanford CS336 — Language Modeling from Scratch

Percy Liang's course. The most rigorous open LLM curriculum, with lectures and assignments.

Open course
University · Free

Stanford CS25 — Transformers United

Rotating guest lectures from leading researchers. Frontier topics, recent talks.

Open course
YouTube · Free · ~6 hrs

3Blue1Brown — Neural Networks

The most beautiful neural network animations on the internet. Watch for visual intuition.

Watch series
📚

Books

Manning · 2024 · Paid

Build a Large Language Model From Scratch

Sebastian Raschka. The closest book to this curriculum. Builds a working LLM end-to-end in PyTorch.

Buy at Manning
MIT Press · Free online

Deep Learning

Goodfellow, Bengio, Courville. The reference textbook. Comprehensive coverage of foundations.

Read online
Free online

Neural Networks and Deep Learning

Michael Nielsen. The clearest introduction to backpropagation ever written. Read Chapters 1-2 alongside Day 3-4.

Read online
Free draft online

Speech and Language Processing

Jurafsky & Martin (3rd ed). The NLP reference textbook. Useful context for tokenization and language modeling.

Read draft
Free PDF

Mathematics for Machine Learning

Deisenroth, Faisal, Ong. Concise math primer if Day 1's pace is too fast. Chapters 2, 5, 6 are most relevant.

Download PDF
Morgan Kaufmann · 4th ed.

Programming Massively Parallel Processors

Hwu, Kirk, El Hajj. The CUDA bible. We reference it from Day 16 onward for systems-level GPU work.

Find on Amazon
📄

Canonical Papers

2017 · The one

Attention Is All You Need

Vaswani et al. Introduces the Transformer. Read Section 3 carefully. Day 6-7 companion.

Read on arXiv
2020 · GPT-3

Language Models are Few-Shot Learners

Brown et al. The 175B model that triggered the modern LLM era. In-context learning emerges with scale.

Read on arXiv
2023 · LLaMA

LLaMA: Open and Efficient Foundation Models

Touvron et al. The open-weights model that catalyzed the open-source LLM ecosystem. Modern arch (RMSNorm, RoPE, SwiGLU).

Read on arXiv
2022-24 · FlashAttention

FlashAttention

Tri Dao et al. IO-aware attention. Day 21 deep dive. (V2: 2307.08691, V3: 2407.08608)

Read v1
2023 · vLLM

PagedAttention (vLLM)

Kwon et al. Efficient memory management for LLM serving. The paper behind continuous batching.

Read on arXiv
2022 · Chinchilla

Training Compute-Optimal LLMs

Hoffmann et al. The corrected scaling laws. Most pre-Chinchilla models were dramatically under-trained.

Read on arXiv
2021 · LoRA

LoRA: Low-Rank Adaptation

Hu et al. Parameter-efficient fine-tuning that became the default. Day 13 companion.

Read on arXiv
2023 · DPO

Direct Preference Optimization

Rafailov et al. RLHF without RL. Cleaner, more stable, increasingly the default for alignment.

Read on arXiv
2022 · Speculative

Speculative Decoding

Leviathan et al. Use a small "draft" model to propose tokens; verify with the big model. Day 23 companion.

Read on arXiv
💾

Reference Implementations

~150 lines · Python

karpathy/micrograd

Autograd from scratch. The whole concept of automatic differentiation in 150 lines you can read in one sitting.

Open on GitHub
~600 lines · Python

karpathy/nanoGPT

GPT training, minimal. Trains a Shakespeare-quality model in ~3 minutes on a single A100. The cleanest LLM training code in existence.

Open on GitHub
~5000 lines · Pure C/CUDA

karpathy/llm.c

GPT-2 in raw C with hand-written CUDA kernels. Outperforms PyTorch on the same hardware. Required reading before the capstone.

Open on GitHub
~700 lines · Pure C

karpathy/llama2.c

LLaMA inference in pure C, no dependencies. Single file. Builds the mental model of inference at the most stripped-down level.

Open on GitHub
~50K lines · C++

ggml-org/llama.cpp

Production-quality CPU/Metal/CUDA inference for the GGUF format. Started by Georgi Gerganov. Powers most local-first LLM apps.

Open on GitHub
~50K lines · Python

vllm-project/vllm

The most popular open inference server. Continuous batching, PagedAttention, prefix caching. Day 24 + capstone benchmark target.

Open on GitHub
Apple · Swift/C++/Python

ml-explore/mlx

Apple's array framework, designed from scratch for unified memory architecture. Companion examples repo at mlx-examples.

Open on GitHub
~250K lines · Python

huggingface/transformers

The production reference implementation of every major model. Large but searchable; the place to look up "what does X actually do."

Open on GitHub
Day 25 target

z-lab/dflash

The repo we'll dissect on Day 25. We'll explain block-diffusion speculative decoding and where it fits relative to draft/verify decoding and the capstone scheduler.

Open on GitHub
🌐

Blogs & Reference Sites

Visual

Jay Alammar — The Illustrated Transformer

The most beautiful visual walkthrough of the Transformer ever made. Read it before Day 6.

Open blog
Code-along

Harvard NLP — The Annotated Transformer

The 2017 paper presented side-by-side with working PyTorch code. Foundational walkthrough.

Open blog
Survey-style

Lilian Weng's Blog

OpenAI's Lilian Weng writes survey-style deep dives on every major LLM topic. Densely useful.

Open blog
Substack

Sebastian Raschka — Magazine

Monthly deep dives on LLM training, fine-tuning, and architecture. Among the most rigorous popular ML writing.

Open substack
Stanford

HazyResearch Blog

Chris Ré's group at Stanford. Source of FlashAttention, Mamba, Hyena, and many systems-for-ML innovations.

Open blog
NVIDIA

NVIDIA Developer Blog

The canonical source for CUDA, TensorRT-LLM, and GPU performance writeups. Filter for "LLM" tag.

Open blog

Field Notes

A few things worth knowing as you set out.

GPT stands for "Generative Pre-trained Transformer." LLaMA stands for "Large Language Model Meta AI." BERT stands for "Bidirectional Encoder Representations from Transformers."
The original GPT (2018) had 117M parameters. GPT-4 reportedly has ~1.8 trillion in a mixture-of-experts setup — roughly 15,000× larger in 6 years.
The first NVIDIA GPU with Tensor Cores (V100) shipped in 2017 — the same year the Transformer paper was published. The hardware that would power LLMs arrived simultaneously with the algorithm.
vLLM — the most popular open inference engine — was built by three PhD students at UC Berkeley's Sky Computing Lab as part of a research paper. It now serves billions of requests daily.
Karpathy's micrograd is only ~150 lines of Python, yet it re-implements PyTorch's autograd from first principles. It's the warm-up exercise for "Neural Networks: Zero to Hero."
nanoGPT trains a Shakespeare-quality model in ~3 minutes on a single A100. The original GPT-2 took weeks on a fleet of V100s.
FlashAttention was developed by Tri Dao as part of his PhD thesis at Stanford. The first paper changed how every modern LLM is served.
The original Transformer was trained on 8 NVIDIA P100 GPUs over 3.5 days. Today, that compute would cost about $200 on a cloud platform.
The phrase "KV cache" wasn't widely used until ~2020. Before that, people just called it "stored states" or "saved activations" — though the technique is as old as autoregressive Transformers.
A Note on Mindset

Six tips, five pitfalls, one mindset.

ML systems stack abstractions unusually deep — math, tensors, autograd, frameworks, kernels, drivers, silicon. You'll be confused often. The goal is to get productively confused.

Tips that actually help

  1. Type, don't copy-paste. Hand-typing creates muscle memory and forces you to read every line.
  2. Print tensor shapes everywhere. 80% of ML bugs are shape mismatches. Add print(x.shape) liberally.
  3. Reproduce before customizing. Get a known-working baseline before changing anything.
  4. One concept per session. Don't try to absorb attention + RoPE + GQA in one sitting.
  5. Re-derive from memory. Close the doc and write the equations from scratch.
  6. Read papers in three passes. Abstract+figures, then intro+conclusions, then methods+math.

Common pitfalls

  1. Tutorial hell. Watching videos without coding. Cap video time at 30% of study time.
  2. Premature optimization. Don't write CUDA on Day 5. Get a feel for what we're optimizing first.
  3. Skipping math. Backprop and softmax are not optional. Fluency, not mastery.
  4. Big-model envy. You won't train GPT-4. You'll train a 10M-param model that produces gibberish — that's the right scale.
  5. Tooling rabbit holes. Don't spend three days configuring vim or your shell. Get a working setup and move on.