Today the Week 2 capstone: you assemble all the pieces from Days 5–8 into a trainable character-level GPT, train it on TinyShakespeare on your own machine, and sample text that improves from random noise to Shakespearean cadence. Along the way you will learn to read the loss curve, tune sampling, estimate parameter counts and FLOPs, and understand exactly why serving this model token-by-token is expensive — and how the KV cache (Week 3) fixes it.
Day 5 gave you tokenization and embeddings. Day 6 gave you attention. Day 7 gave you the Transformer block. Day 8 gave you the pre-training objective and scaling laws. Today you put them all in a room together, add a training loop, and watch a model learn language for the first time. This is the moment that makes the entire preceding week feel real.
We deliberately stay small: a roughly 10M-parameter GPT on roughly 1MB of Shakespeare. The model trains in minutes on a laptop GPU, in tens of minutes on Apple Silicon, and in a few hours on CPU. Small scale is a feature — you can run the loop twenty times, break things intentionally, and build the intuition that no amount of reading can substitute for. The loop you write today is exactly the loop that trains GPT-3; the only difference is bigger numbers and more machines.
We also spend serious time on inference cost. After training, you will count FLOPs per token, measure memory footprint, and understand why naive autoregressive decoding scales as O(T²). That analysis sets up the KV cache — the central optimization of Week 3 — and connects the training capstone to the inference theme of the whole course.
GPTConfig and its effect on model size, memory, and quality.Before writing a single line of code, draw the whole graph in your head. A batch of token-ID sequences enters — shape (B, T) where B is batch size and T is context length. It exits as a probability distribution over the vocabulary at every position — shape (B, T, V). Everything in between is a progression of tensor transformations, each with a known shape. Understanding the shapes makes bugs obvious.
(B,T) are looked up in two embedding tables and summed; the result travels through N pre-norm decoder blocks — shape unchanged throughout — then a final LayerNorm and a tied linear head to produce logits (B,T,V). Every box is differentiable; the whole graph is trained end-to-end by backprop.The original 2017 Transformer paper put LayerNorm after the residual addition (post-norm). GPT-2 and everything since puts it before the sublayer (pre-norm). The difference matters. In post-norm, the signal going into the residual path may be unnormalized, which makes deep networks harder to train without careful learning-rate warmup. Pre-norm normalizes before the sublayer, so the residual path always carries the raw residual stream, which remains well-scaled even at depth 48 or 96. Pre-norm is now the standard; expect to see it everywhere.
The LM head is a linear map from the model dimension D to the vocabulary V. The embedding table is also a matrix of shape (V, D). Weight tying shares these two matrices: the head literally uses the transpose of the embedding lookup. This saves V×D parameters (for our config: 65×384 = 24,960 — a small saving; for a 128k-token BPE vocab with D=4096, it saves 500M parameters). More importantly, it constrains the model to use the same geometry for encoding and decoding tokens, which improves sample quality, especially with small vocabularies.
from dataclasses import dataclass
import torch, torch.nn as nn, torch.nn.functional as F
@dataclass
class GPTConfig:
vocab_size: int = 65 # set from data
block_size: int = 256 # context length T
n_layer: int = 6 # depth
n_head: int = 6 # attention heads
n_embd: int = 384 # d_model = head_dim * n_head
dropout: float = 0.1
class CausalSelfAttention(nn.Module):
def __init__(self, c):
super().__init__()
assert c.n_embd % c.n_head == 0
self.n_head, self.n_embd = c.n_head, c.n_embd
self.qkv = nn.Linear(c.n_embd, 3 * c.n_embd, bias=False)
self.proj = nn.Linear(c.n_embd, c.n_embd, bias=False)
self.drop = nn.Dropout(c.dropout)
self.p = c.dropout
def forward(self, x):
B, T, C = x.shape
q, k, v = self.qkv(x).chunk(3, dim=-1)
dh = C // self.n_head
def split_heads(t):
return t.view(B, T, self.n_head, dh).transpose(1, 2)
q, k, v = split_heads(q), split_heads(k), split_heads(v)
# PyTorch 2.0+ fused attention (FlashAttention when available)
y = F.scaled_dot_product_attention(
q, k, v, is_causal=True,
dropout_p=self.p if self.training else 0.0)
y = y.transpose(1, 2).contiguous().view(B, T, C)
return self.drop(self.proj(y))
class MLP(nn.Module):
def __init__(self, c):
super().__init__()
self.fc = nn.Linear(c.n_embd, 4 * c.n_embd)
self.proj = nn.Linear(4 * c.n_embd, c.n_embd)
self.drop = nn.Dropout(c.dropout)
def forward(self, x):
return self.drop(self.proj(F.gelu(self.fc(x))))
class Block(nn.Module):
def __init__(self, c):
super().__init__()
self.ln1, self.ln2 = nn.LayerNorm(c.n_embd), nn.LayerNorm(c.n_embd)
self.attn, self.mlp = CausalSelfAttention(c), MLP(c)
def forward(self, x):
x = x + self.attn(self.ln1(x)) # pre-norm residual
x = x + self.mlp(self.ln2(x))
return x
class GPT(nn.Module):
def __init__(self, c):
super().__init__()
self.cfg = c
self.tok_emb = nn.Embedding(c.vocab_size, c.n_embd)
self.pos_emb = nn.Embedding(c.block_size, c.n_embd)
self.drop = nn.Dropout(c.dropout)
self.blocks = nn.ModuleList([Block(c) for _ in range(c.n_layer)])
self.ln_f = nn.LayerNorm(c.n_embd)
self.head = nn.Linear(c.n_embd, c.vocab_size, bias=False)
self.head.weight = self.tok_emb.weight # weight tying
def forward(self, idx, targets=None):
B, T = idx.shape
pos = torch.arange(T, device=idx.device)
x = self.drop(self.tok_emb(idx) + self.pos_emb(pos))
for blk in self.blocks:
x = blk(x)
logits = self.head(self.ln_f(x)) # (B, T, V)
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
A GPTConfig dataclass is not just convenience — it is a contract. Every architectural decision lives in one place, is passed to every submodule, and is saved in the checkpoint alongside the weights. You cannot load weights without knowing the config. Here is what each field controls.
| Field | Default | Effect on model | Effect on memory / speed |
|---|---|---|---|
n_layer | 6 | Depth (number of blocks). The main lever for quality. | Linear in memory; roughly linear in FLOPs/token. |
n_embd | 384 | Model width D. Quadratic in parameters per block. | Dominates parameter count and activation memory. |
n_head | 6 | Head dimension = D/n_head = 64. More heads = more parallel views of the sequence. | No effect on total params; affects fused-attention efficiency. |
block_size | 256 | Context length T. The maximum sequence length at train time. | KV matrices scale as T×D; attention compute as T²×D. |
dropout | 0.1 | Regularization. Drop 10% of activations during training. | Zero at inference (disabled by model.eval()). |
vocab_size | 65 | Character count for this corpus. BPE models use 50k–128k. | Affects embedding table and LM head size. |
You should be able to compute the parameter count for any GPT config on the back of an envelope. The formula: one block has an attention sublayer and an MLP sublayer.
The rule of thumb is 12D² per block. Double D, and parameters quadruple. Halve n_layer, and parameters halve. This is why scaling laws (Day 8) say that D and n_layer should be scaled together — neither dimension dominates if you scale jointly.
Before you run your first training step, there is one check you must always do: verify that the loss at initialization is approximately ln(vocab_size). This is not a nice-to-have — it is the first line of defense against a wide class of bugs. Here is the intuition.
A freshly initialized model with Xavier/Kaiming weights will assign roughly equal logits to each vocabulary token. The softmax of equal logits is a uniform distribution. For a uniform distribution over V tokens, the cross-entropy loss is exactly:
If your step-0 loss is far from ln(vocab_size), you have a bug before any training has happened. Common culprits: a target-shift bug (using x as targets instead of x[:, 1:]), a data normalization error that produces constant inputs, a weight initialization that is wildly off-scale, or (for BPE tokenizers) a mismatch between the tokenizer's vocabulary size and the model's vocab_size.
This check is called out explicitly by Karpathy in his "A Recipe for Training Neural Networks" and has probably saved thousands of hours of wasted training compute. A wrong init loss means something is broken structurally — no amount of training will fix it. Always check before you train.
import math
xb, yb = get_batch(train_data)
with torch.no_grad():
_, loss0 = model(xb, yb)
expected = math.log(cfg.vocab_size)
print(f"loss at init: {loss0.item():.4f} expected: {expected:.4f}")
assert abs(loss0.item() - expected) < 0.5, \
f"Init loss is off! Check model/data pipeline."
print("Sanity check passed.")
The tolerance of 0.5 nats is generous — a well-initialized model usually lands within 0.1 of the theoretical value. If you are off by more than 0.5, investigate before training.
PyTorch's default initialization (Kaiming uniform for Linear, normal for Embedding) works reasonably well. The nanoGPT convention, borrowed from GPT-2, applies an additional scaling to the output projections of each residual block: multiply the initial weights by 1/sqrt(2 * n_layer). This prevents the residual stream from growing uncontrollably with depth at initialization — especially important when n_layer is large. Our small 6-layer model trains fine without it, but it is good practice to know about.
The core loop is four operations. Everything else — learning rate schedules, gradient clipping, mixed precision, gradient accumulation — is engineering layered on top of those four operations. Let us build up from the bare minimum to something production-adjacent.
zero_grad(set_to_none=True) frees the gradient memory rather than zeroing it in place — slightly faster. Gradient clipping happens after backward but before step. The eval branch runs every few hundred steps to track generalization without affecting training state.AdamW uses per-parameter adaptive learning rates based on running estimates of the first moment (mean of gradients) and second moment (mean of squared gradients). The "W" stands for decoupled weight decay — it applies L2 regularization to the parameters directly rather than through the gradient, which is the correct formulation. The standard GPT settings:
The weight_decay of 0.1 is applied selectively: weight matrices get it; bias terms and LayerNorm parameters do not. This is standard practice — biases and LN parameters are already small scalars and do not need shrinkage. In PyTorch:
decay, no_decay = [], []
for pn, p in model.named_parameters():
if p.ndim < 2: # bias, LN gamma/beta
no_decay.append(p)
else:
decay.append(p) # weight matrices
opt = torch.optim.AdamW([
{"params": decay, "weight_decay": 0.1},
{"params": no_decay, "weight_decay": 0.0},
], lr=3e-4, betas=(0.9, 0.95))
Gradient clipping bounds the L2 norm of the gradient vector before the optimizer step. If the norm exceeds max_norm, all gradients are scaled down proportionally. This prevents a single bad batch from causing a large destabilizing parameter update — especially important early in training when gradients can be chaotic.
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
A clip value of 1.0 is the standard; higher values (5.0) are used for some RNN-style training. If you see occasional loss spikes but overall good training, clipping (or reducing LR) is often the fix.
Constant learning rate works for small models, but the production practice is a two-phase schedule: linear warmup from near zero to peak LR over the first few hundred steps, then cosine decay back down to a small fraction of peak LR. Warmup prevents the large early updates that happen when AdamW's running estimates are unreliable (they are initialized to zero). Cosine decay allows aggressive use of the full LR during most of training, then a smooth landing.
def get_lr(step, warmup_steps, max_steps, lr_peak, lr_min):
if step < warmup_steps:
return lr_peak * step / warmup_steps
if step > max_steps:
return lr_min
progress = (step - warmup_steps) / (max_steps - warmup_steps)
return lr_min + 0.5 * (lr_peak - lr_min) * (1 + math.cos(math.pi * progress))
# Apply inside the loop:
for step in range(max_steps):
lr = get_lr(step, warmup_steps=100, max_steps=5000,
lr_peak=3e-4, lr_min=3e-5)
for param_group in opt.param_groups:
param_group["lr"] = lr
...
Training in float16 or bfloat16 roughly halves memory and speeds up matrix multiplications on modern hardware. The standard approach is PyTorch's torch.autocast context manager, which casts eligible operations to the lower-precision type automatically, while keeping a master copy of weights in float32 for the optimizer step. Day 10 covers this in detail; for today's small run it is optional.
model = GPT(cfg).to(device)
# Selective weight decay (weight matrices only)
decay, no_decay = [], []
for pn, p in model.named_parameters():
(no_decay if p.ndim < 2 else decay).append(p)
opt = torch.optim.AdamW([
{"params": decay, "weight_decay": 0.1},
{"params": no_decay, "weight_decay": 0.0},
], lr=3e-4, betas=(0.9, 0.95))
max_steps = 5000
warmup_steps = 100
eval_interval = 500
eval_iters = 50
@torch.no_grad()
def estimate_loss():
model.eval()
out = {}
for name, d in [("train", train_data), ("val", val_data)]:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
xb, yb = get_batch(d)
_, loss = model(xb, yb)
losses[k] = loss.item()
out[name] = losses.mean().item()
model.train()
return out
history = []
for step in range(max_steps + 1):
# LR schedule
lr = get_lr(step, warmup_steps, max_steps, 3e-4, 3e-5)
for g in opt.param_groups:
g["lr"] = lr
# Periodic eval
if step % eval_interval == 0:
m = estimate_loss()
history.append((step, m["train"], m["val"]))
print(f"step {step:>5} train {m['train']:.3f} val {m['val']:.3f} lr {lr:.2e}")
# Training step
xb, yb = get_batch(train_data)
_, loss = model(xb, yb)
opt.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step()
The model produces logits (V,) at each step. Converting those logits to a single next token is a decision that dramatically shapes the output. You have four main strategies, each with distinct tradeoffs.
| Strategy | How it works | Effect on output | When to use |
|---|---|---|---|
| Greedy | Always pick argmax(logits). | Deterministic, often repetitive. May loop. | Debugging, benchmarks where reproducibility matters. |
| Temperature (τ) | Divide logits by τ before softmax. τ<1 sharpens, τ>1 flattens. | τ=0 → greedy; τ=1 → model distribution; τ>1 → more random. | Creative text (τ=0.7–1.2); use as your main dial. |
| Top-k | Zero out all but the k highest-logit tokens, then sample. | Prevents sampling very unlikely tokens regardless of τ. | k=40–200 is a good default; combine with τ. |
| Top-p (nucleus) | Sort tokens by probability; keep the smallest set whose cumulative probability ≥ p; sample from that set. | Adapts the candidate set size to the model's confidence. Wider when uncertain, narrower when confident. | p=0.9 or 0.95 is more principled than fixed top-k. |
@torch.no_grad()
def generate(model, idx, max_new_tokens, temperature=1.0, top_k=None, top_p=None):
for _ in range(max_new_tokens):
# Crop to context window
idx_cond = idx[:, -model.cfg.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature # (B, V), last position only
# Top-k filter
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float("inf")
# Top-p (nucleus) filter
if top_p is not None:
sorted_logits, sorted_idx = torch.sort(logits, descending=True)
cumprobs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative prob above top_p (shift by 1 to keep first)
sorted_logits[cumprobs - F.softmax(sorted_logits, dim=-1) > top_p] = -float("inf")
logits = torch.zeros_like(logits).scatter(1, sorted_idx, sorted_logits)
probs = F.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
idx = torch.cat([idx, next_id], dim=1)
return idx
Run this at several points in training and you will watch quality climb. At step 0: completely random characters. At step 500: spaces and punctuation are roughly correct. At step 2000: word shapes appear. At step 5000: recognizable character names, line breaks, and Shakespearean rhythm. The model will not be good — it is 10M parameters on 1MB of text — but the trajectory is unmistakable and deeply satisfying.
The loss curve is your most powerful diagnostic tool. Every failure mode has a characteristic shape. Here is how to read it.
Expected final loss for a 10M-parameter GPT on TinyShakespeare after 5,000 steps: roughly 1.4–1.6 nats, corresponding to a perplexity of ~4–5. That means the model is on average as uncertain as choosing uniformly among 4–5 equally likely characters. Not great literature, but definitive proof that the model learned the structure of the corpus.
TinyShakespeare is ~1MB — small enough for a 10M-parameter model to partially memorize. You will typically see a small but stable train-val gap. If you reduce the dataset to a few kilobytes and train long enough, you can drive train loss to near zero while val loss rises sharply. This is pedagogically useful: it shows exactly what overfitting looks like and why regularization (dropout, weight decay) matters.
Perplexity is just exp(loss). A loss of 1.5 nats gives perplexity exp(1.5) ≈ 4.5. The interpretation: on average, the model is as uncertain as if it were choosing uniformly among 4–5 characters. A character-level model on English text trained to its limit reaches roughly perplexity 2–3 (because English has low character-level entropy). Our tiny model with limited data sits at 4–5, which is respectable.
Training this model once costs some GPU-hours. Serving it for inference — generating one token at a time, possibly for many concurrent users — is a different and ongoing cost. Understanding the cost structure now sets up everything in Weeks 3–4.
For a GPT-style model, the approximate floating-point operations for a single forward pass on a sequence of length T is:
The general rule of thumb: approximately 6 × (parameter count) FLOPs per token for the linear layers (factor of 2 for multiply-add, factor of 3 for forward + backward in training). For inference, it is roughly 2 × params FLOPs per token from linear layers, plus the quadratic attention cost.
For inference, you need to hold two things in memory: the weights (static) and the activations of the current forward pass (dynamic, grows with batch size and sequence length).
| Item | Our tiny model | GPT-3 (175B) | Notes |
|---|---|---|---|
| Weights (float32) | ~43 MB | ~700 GB | 10.76M × 4 bytes |
| Weights (float16/bf16) | ~22 MB | ~350 GB | 2 bytes per param |
| KV cache (float16, T=2048, B=1) | ~24 MB per layer | Hundreds of GB | 2×T×D×n_layer×2 bytes |
| Activations (training, float32) | ~several GB | Terabytes | Needed for backward pass; not stored at inference |
This is why serving large models requires specialized hardware. A 70B-parameter model in float16 needs ~140 GB — more than a single 80GB A100/H100, so it requires multiple GPUs or quantization (an 80GB GPU holds roughly a 30–40B model in fp16). The KV cache alone for a long context can consume as much memory as the weights. Week 3 and 4 cover the techniques â quantization, KV cache management, paged attention, speculative decoding â that make this tractable.
In naive generation, to produce token t+1, you run the full forward pass on tokens 0..t. The attention mechanism computes QK^T/sqrt(d_h) for all pairs, which costs O(t) per layer. Summed over all T tokens you generate, total attention cost is O(T²). The KV cache short-circuits this: once you have computed the K and V matrices for positions 0..t-1, you store them. For token t+1, you only compute Q for the new token, and attend over the cached K,V. The marginal cost per new token drops from O(t) to O(1) in the attention sublayer. The catch: you now need memory proportional to T (to store the cache), and the memory bandwidth to read that cache dominates latency — which motivates quantization, multi-query attention (MQA), and grouped-query attention (GQA), all of which you will see in Week 3.
A trained model you cannot reload is a model you must retrain. Checkpointing is trivial but discipline-forming: it makes you think clearly about which state is essential for training versus inference, and it foreshadows the weight-loading pipeline in your Day 27 capstone.
# Save: everything needed to resume training.
torch.save({
"model": model.state_dict(),
"optimizer": opt.state_dict(),
"config": cfg,
"step": step,
"stoi": stoi, # character-to-index map
"itos": itos, # index-to-character map
}, "tinygpt.pt")
# Load for resuming training (need optimizer state + step).
ckpt = torch.load("tinygpt.pt", map_location=device, weights_only=False)
model = GPT(ckpt["config"]).to(device)
model.load_state_dict(ckpt["model"])
opt.load_state_dict(ckpt["optimizer"])
step = ckpt["step"]
# Load for inference only (just weights + config).
ckpt = torch.load("tinygpt.pt", map_location=device, weights_only=True)
model = GPT(ckpt["config"]).to(device)
model.load_state_dict(ckpt["model"])
model.eval()
Save the optimizer state (including AdamW's running moment estimates) if you intend to resume training — without it, the first optimizer step after reload will be wrong because the moment estimates are reset to zero. For pure inference, only the model weights and config are needed. We will return to weight formats (safetensors, sharded checkpoints for models that do not fit in one file) in Week 4.
| Path | Device string | Expected wall-clock (5k steps) | Notes |
|---|---|---|---|
| NVIDIA GPU (Ampere/Hopper) | "cuda" | ~2–5 min (RTX 4090); longer on older GPUs | FlashAttention via F.scaled_dot_product_attention; add autocast for bf16. |
| Apple Silicon (M-series) | "mps" | ~10–20 min on M2/M3 | MPS backend stable as of PyTorch 2.1. MLX is faster for Apple-native models. |
| CPU | "cpu" | Hours for full config; reduce to n_layer=4, n_embd=128, block_size=128 | Still produces a working model — the lesson survives the smaller scale. |
device = ("cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu")
print("using device:", device)
If you are on CPU and time-constrained, shrink the config: n_layer=4, n_embd=128, block_size=128, batch_size=32. Run for 1,000 steps instead of 5,000. The val loss will be worse (~2.0) and the samples rougher, but the loop is identical and the learning is real.
Companion notebook: day-9-tiny-gpt.ipynb.
ln(vocab_size). Then break the target shift on purpose (use x as targets instead of x[:, 1:]) and observe the wrong init loss.train_data = data[:2000] and train for 3,000 steps. Plot the train vs val gap. At what step does val loss start rising?sum(p.numel() for p in model.parameters()) and compare to the 12D² formula. Then compute the FLOPs-per-token estimate. How does it compare to a single A100's ~310 TFLOPS?tiktoken and the datasets library, then retrain on TinyStories. With the same 10M-parameter model you get coherent short sentences — proof that the architecture scales gracefully to richer data.Close the page and answer from memory. If you can't, re-read the relevant section.
(B,T) through every stage of the model to the logits. What shape exits each major component?zero_grad use set_to_none=True?n_layer=12, n_embd=768 (GPT-2 small). What about n_layer=96, n_embd=12288 (GPT-3)?"The loop is four lines: forward, backward, step, repeat. GPT-3 is this loop with bigger numbers and a thousand GPUs. The difference is scale, not kind."
The canonical "build and train a GPT" references, plus inference-cost essentials.
The definitive walkthrough of exactly today's build, on TinyShakespeare. The single best 2-hour investment for this lesson.
Watch on YouTubeThe reference our notebook mirrors. Read train.py and model.py. The most legible production-quality GPT training code.
The single-file version from the video — ~300 lines, fully legible. The ideal reference when you want to check your implementation.
View repo"Verify the loss at init" and other hard-won debugging wisdom. Read this once a year.
Read postThe scaling law paper that changed how LLMs are trained. Directly relevant to the FLOPs and parameter-count discussion in this lesson.
Read on arXivWhat actually happens inside a small GPT as it learns. A mechanistic look at the model you just built.
Read postTiny models produce coherent text on it. Use it for Exercise 8 — swapping in BPE tokenization is the natural next step after char-level.
View datasetApple Silicon-native training. The optional MLX track in the notebook follows this implementation.
View repo