Capstone Part 2: KV Cache, Masks, and Sampling Loop

A single forward pass becomes an inference engine when it can prefill once, decode one token at a time from cache, sample controllably, and stop correctly.

Time~230 min

DifficultyHard

PrerequisiteDays 15, 20, 27

Notebookday-28-capstone-pt2

Why This Lesson

Why this optimization matters.

Day 27 proved the model can run. Day 28 makes it usable. KV cache is the difference between recomputing the whole prefix every token and appending only the new key/value rows. Sampling turns raw logits into product behavior.

Learning Objectives

What you should be able to do today.

Add optional per-layer KV cache to attention.
Distinguish prefill causal masks from decode masks.
Implement generate(prompt, max_new_tokens, sampling_params).
Support greedy, temperature, top-k, and top-p sampling.
Benchmark no-cache versus cached decode.

Notation Cheatsheet

Decode the symbols before using them.

prefill is the full prompt forward that initializes cache.
decode is one new-token step using existing cache.
cache_len is the number of valid cached positions.
position_offset is the RoPE position where the new token starts.
eos_token_id stops generation.

Cache API

Attention must accept reusable state.

Update attention to accept cache state. Prefill appends many positions at once. Decode appends one. The cache object should track valid length so uninitialized preallocated memory is never attended to.

Prefill initializes the cache; decode extends it one position at a time.

Masks Change Shape

Prefill and decode masks are different objects.

Prefill for a prompt length T uses a causal [T, T] mask. Decode for one new token uses a single query row against all valid cached keys, usually all zeros over [1, cache_len + 1]. Most KV-cache bugs are off-by-one mask or RoPE offset bugs.

Prefill needs triangular masking; decode has one current query.

Generation Loop

Prefill once, then decode from state.

The loop is tokenize, prefill prompt and store K/V, sample first next token, then repeatedly run one-token decode, sample, append, and stop on EOS or max_new_tokens.

Every generated token appends one K row and one V row per layer.

Sampling Contract

Sampling is part of the engine contract.

Greedy decode is deterministic. Temperature divides logits before softmax. Top-k removes all but the largest k logits. Top-p sorts probabilities and keeps the smallest set whose cumulative mass is at least p. For reference matching, start with greedy.

Sampling controls happen after logits and before token selection.

Benchmark Shape

Benchmark the exact shape split.

Measure prompt lengths [32, 128, 512] and output lengths [32, 128, 256]. Compare no-cache and cached generation. Expect cache speedup to grow as context length grows.

Token mismatches are symptoms; these are common root causes.

Did You Know?

A detail worth remembering.

KV cache improves latency but not mathematical output. With the same weights, positions, and masks, cached and uncached logits should match up to floating-point noise.

Exercise

Build the habit with code.

Add cache state to attention and verify cached logits match uncached logits on a toy sequence.
Implement greedy and temperature/top-k/top-p sampling.
Generate 100 greedy tokens twice and verify determinism.
Benchmark no-cache versus cache for the prompt/output grid.

Self-Check

Answer these from memory.

Why no triangular mask for one-token decode? The single new query can see all valid past keys; there are no future query positions in that row.
What does cache_len protect against? Attending to uninitialized or stale cache slots.
Why compare greedy first? It removes sampling randomness from correctness tests.
What causes RoPE drift? Using position 0 for every decode step instead of the absolute token position.
What is the speedup source? Avoiding repeated K/V projection and prefix computation during decode.