Day 27 · Week 4 · Optimization & Capstone

⚒

Capstone Part 1: Model Loader, Weights, and Single-Sequence Forward

The capstone starts with the smallest honest engine: load a real decoder-only model configuration, bind weights by name, run one forward pass, and verify logits against a reference implementation.

Time~240 min

DifficultyHard

PrerequisiteDays 5-12

Notebookday-27-capstone-pt1

Why This Lesson

Why this optimization matters.

You have studied tokenization, attention, Transformer blocks, modern LLaMA-style architecture, and inference loops. Day 27 turns that knowledge into source code. The first milestone is intentionally narrow: one prompt, one sequence, no KV cache yet, and correctness before speed.

Learning Objectives

What you should be able to do today.

Read config.json into a typed LLMConfig.
Map safetensors weight names to model submodules.
Build embeddings, RMSNorm, RoPE, attention, SwiGLU FFN, blocks, and LM head.
Run a single forward pass and compare logits to HuggingFace Transformers.
Implement greedy decoding for a short continuation.

Notation Cheatsheet

Decode the symbols before using them.

hidden_size is model width, such as 2048.
n_heads is query attention head count.
n_kv_heads is key/value head count under GQA.
head_dim = hidden_size / n_heads unless explicitly set.
max_abs_diff is the largest absolute logit difference against the reference.

Target Model

Choose a realistic but reachable architecture.

The course assumes TinyLlama-1.1B as the default capstone target because it is LLaMA-like, small enough for accessible hardware, and still structurally realistic. Example values are n_layers = 22, n_heads = 32, n_kv_heads = 4, head_dim = 64, hidden_size = 2048, and intermediate_size = 5632.

Day 27 owns correctness for one sequence before cache or batching.

Loader Before Model Cleverness

Loading is a correctness problem first.

A professional engine begins with boring correctness: load config, list tensors, assert shapes, and fail loudly when names do not match. Safetensors is ideal because it stores tensor metadata and avoids pickle execution.

GQA changes K/V shapes but not Q/O hidden width.

Forward Pass Shape Trace

Trace every tensor shape.

Single-sequence forward is token ids -> embeddings -> repeated transformer blocks -> final RMSNorm -> LM head -> logits. The logits shape is [B, T, vocab_size]. For generation you only need logits[:, -1, :], but for verification you compare the full output against Transformers.

Keep shape annotations in the code until verification passes.

Verification Contract

Compare logits component by component.

Correctness means same token IDs, same dtype policy, same position IDs, same attention mask, same model mode, and no accidental dropout. Debug by comparing embeddings first, then norm, Q/K/V projections, attention, FFN, and final logits.

Compare small component outputs before blaming the whole model.

Minimal Greedy Decode

Generate only after logits match.

After logits match, greedy decode is simple: take argmax of the last logits, append the token, and run forward again. This is intentionally inefficient because Day 28 adds KV cache.

This is correct but slow; Day 28 replaces repeated prefix work with cache.

Exercise

Build the habit with code.

Download or point to a small LLaMA-style checkpoint.
Load config.json and assert every required field.
List all safetensors keys and map layer 0 shapes.
Run one forward pass and compare logits to Transformers.
Generate 20 greedy tokens and save the prompt, output, and logit diff.

Self-Check

Answer these from memory.

Why verify logits instead of just generated text? Text can match by accident; logits expose numeric component bugs.
What shape does GQA change? K/V projection output width is n_kv_heads head_dim, smaller than Q.*
Why no KV cache today? Correct single-forward math must be proven before optimizing the decode loop.
Common source of RoPE bugs? Wrong position offset or wrong head_dim pairing.
Why use config-driven construction? The same engine can load model variants without hard-coded dimensions.

Go deeper.

Primary references and the companion notebook for today's exercise.

Repo

safetensors

Safe tensor serialization.

Open

Repo

TinyLlama

Default capstone target.

Open

Source

HuggingFace Llama model

Reference implementation.

Open

Notebook

Day 27 notebook

Runnable companion notebook for the lesson.

Open notebook

Capstone Part 1: Model Loader, Weights, and Single-Sequence Forward

Why this optimization matters.

What you should be able to do today.

Decode the symbols before using them.

Choose a realistic but reachable architecture.

Loading is a correctness problem first.

Trace every tensor shape.

Compare logits component by component.

Generate only after logits match.

A detail worth remembering.

Build the habit with code.

Answer these from memory.

Go deeper.

safetensors

TinyLlama

HuggingFace Llama model

Day 27 notebook