The capstone starts with the smallest honest engine: load a real decoder-only model configuration, bind weights by name, run one forward pass, and verify logits against a reference implementation.
You have studied tokenization, attention, Transformer blocks, modern LLaMA-style architecture, and inference loops. Day 27 turns that knowledge into source code. The first milestone is intentionally narrow: one prompt, one sequence, no KV cache yet, and correctness before speed.
config.json into a typed LLMConfig.hidden_size is model width, such as 2048.n_heads is query attention head count.n_kv_heads is key/value head count under GQA.head_dim = hidden_size / n_heads unless explicitly set.max_abs_diff is the largest absolute logit difference against the reference.The course assumes TinyLlama-1.1B as the default capstone target because it is LLaMA-like, small enough for accessible hardware, and still structurally realistic. Example values are n_layers = 22, n_heads = 32, n_kv_heads = 4, head_dim = 64, hidden_size = 2048, and intermediate_size = 5632.
A professional engine begins with boring correctness: load config, list tensors, assert shapes, and fail loudly when names do not match. Safetensors is ideal because it stores tensor metadata and avoids pickle execution.
Single-sequence forward is token ids -> embeddings -> repeated transformer blocks -> final RMSNorm -> LM head -> logits. The logits shape is [B, T, vocab_size]. For generation you only need logits[:, -1, :], but for verification you compare the full output against Transformers.
Correctness means same token IDs, same dtype policy, same position IDs, same attention mask, same model mode, and no accidental dropout. Debug by comparing embeddings first, then norm, Q/K/V projections, attention, FFN, and final logits.
After logits match, greedy decode is simple: take argmax of the last logits, append the token, and run forward again. This is intentionally inefficient because Day 28 adds KV cache.
config.json and assert every required field.n_kv_heads head_dim, smaller than Q.*"The first engine milestone is not fast text; it is one prompt whose logits you can defend."
Primary references and the companion notebook for today's exercise.