LLM Inference Engineer · Day 27
Day 27 · Week 4 · Optimization & Capstone

Capstone Part 1: Model Loader, Weights, and Single-Sequence Forward

The capstone starts with the smallest honest engine: load a real decoder-only model configuration, bind weights by name, run one forward pass, and verify logits against a reference implementation.

Time~240 min
DifficultyHard
PrerequisiteDays 5-12
Notebookday-27-capstone-pt1
Why This Lesson

Why this optimization matters.

You have studied tokenization, attention, Transformer blocks, modern LLaMA-style architecture, and inference loops. Day 27 turns that knowledge into source code. The first milestone is intentionally narrow: one prompt, one sequence, no KV cache yet, and correctness before speed.

Learning Objectives

What you should be able to do today.

  1. Read config.json into a typed LLMConfig.
  2. Map safetensors weight names to model submodules.
  3. Build embeddings, RMSNorm, RoPE, attention, SwiGLU FFN, blocks, and LM head.
  4. Run a single forward pass and compare logits to HuggingFace Transformers.
  5. Implement greedy decoding for a short continuation.
Notation Cheatsheet

Decode the symbols before using them.

  • hidden_size is model width, such as 2048.
  • n_heads is query attention head count.
  • n_kv_heads is key/value head count under GQA.
  • head_dim = hidden_size / n_heads unless explicitly set.
  • max_abs_diff is the largest absolute logit difference against the reference.
Target Model

Choose a realistic but reachable architecture.

The course assumes TinyLlama-1.1B as the default capstone target because it is LLaMA-like, small enough for accessible hardware, and still structurally realistic. Example values are n_layers = 22, n_heads = 32, n_kv_heads = 4, head_dim = 64, hidden_size = 2048, and intermediate_size = 5632.

Capstone Module Boundary loader model engine tokenizer verifier Day 27 owns correctness for one sequence before cache or batching.
Day 27 owns correctness for one sequence before cache or batching.
Loader Before Model Cleverness

Loading is a correctness problem first.

A professional engine begins with boring correctness: load config, list tensors, assert shapes, and fail loudly when names do not match. Safetensors is ideal because it stores tensor metadata and avoids pickle execution.

Layer Weight Map q_proj [2048,2048] query heads k_proj [256,2048] 4 KV heads v_proj [256,2048] 4 KV heads o_proj [2048,2048] attention output down_proj [2048,5632] FFN projection GQA changes K/V shapes but not Q/O hidden width.
GQA changes K/V shapes but not Q/O hidden width.
Forward Pass Shape Trace

Trace every tensor shape.

Single-sequence forward is token ids -> embeddings -> repeated transformer blocks -> final RMSNorm -> LM head -> logits. The logits shape is [B, T, vocab_size]. For generation you only need logits[:, -1, :], but for verification you compare the full output against Transformers.

Forward Shape Trace ids [1,T] embed [1,T,D] blocks [1,T,D] lm head [1,T,V] last logits [1,V] Keep shape annotations in the code until verification passes.
Keep shape annotations in the code until verification passes.
Verification Contract

Compare logits component by component.

Correctness means same token IDs, same dtype policy, same position IDs, same attention mask, same model mode, and no accidental dropout. Debug by comparing embeddings first, then norm, Q/K/V projections, attention, FFN, and final logits.

Debug Diff Ladder embedding first norm/proj then attention hard full logits final Compare small component outputs before blaming the whole model.
Compare small component outputs before blaming the whole model.
Minimal Greedy Decode

Generate only after logits match.

After logits match, greedy decode is simple: take argmax of the last logits, append the token, and run forward again. This is intentionally inefficient because Day 28 adds KV cache.

Greedy Decode Without Cache forward prompt argmax append token forward all tokens repeat This is correct but slow; Day 28 replaces repeated prefix work with cache.
This is correct but slow; Day 28 replaces repeated prefix work with cache.
Did You Know?

A detail worth remembering.

Safetensors is popular not just because it is fast, but because it avoids arbitrary code execution during model loading.
Exercise

Build the habit with code.

  1. Download or point to a small LLaMA-style checkpoint.
  2. Load config.json and assert every required field.
  3. List all safetensors keys and map layer 0 shapes.
  4. Run one forward pass and compare logits to Transformers.
  5. Generate 20 greedy tokens and save the prompt, output, and logit diff.
Self-Check

Answer these from memory.

  1. Why verify logits instead of just generated text? Text can match by accident; logits expose numeric component bugs.
  2. What shape does GQA change? K/V projection output width is n_kv_heads head_dim, smaller than Q.*
  3. Why no KV cache today? Correct single-forward math must be proven before optimizing the decode loop.
  4. Common source of RoPE bugs? Wrong position offset or wrong head_dim pairing.
  5. Why use config-driven construction? The same engine can load model variants without hard-coded dimensions.

"The first engine milestone is not fast text; it is one prompt whose logits you can defend."

Day 27 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Repo

safetensors

Safe tensor serialization.

Open
Repo

TinyLlama

Default capstone target.

Open
Source

HuggingFace Llama model

Reference implementation.

Open
Notebook

Day 27 notebook

Runnable companion notebook for the lesson.

Open notebook