A single forward pass becomes an inference engine when it can prefill once, decode one token at a time from cache, sample controllably, and stop correctly.
Day 27 proved the model can run. Day 28 makes it usable. KV cache is the difference between recomputing the whole prefix every token and appending only the new key/value rows. Sampling turns raw logits into product behavior.
generate(prompt, max_new_tokens, sampling_params).prefill is the full prompt forward that initializes cache.decode is one new-token step using existing cache.cache_len is the number of valid cached positions.position_offset is the RoPE position where the new token starts.eos_token_id stops generation.Update attention to accept cache state. Prefill appends many positions at once. Decode appends one. The cache object should track valid length so uninitialized preallocated memory is never attended to.
Prefill for a prompt length T uses a causal [T, T] mask. Decode for one new token uses a single query row against all valid cached keys, usually all zeros over [1, cache_len + 1]. Most KV-cache bugs are off-by-one mask or RoPE offset bugs.
The loop is tokenize, prefill prompt and store K/V, sample first next token, then repeatedly run one-token decode, sample, append, and stop on EOS or max_new_tokens.
Greedy decode is deterministic. Temperature divides logits before softmax. Top-k removes all but the largest k logits. Top-p sorts probabilities and keeps the smallest set whose cumulative mass is at least p. For reference matching, start with greedy.
Measure prompt lengths [32, 128, 512] and output lengths [32, 128, 256]. Compare no-cache and cached generation. Expect cache speedup to grow as context length grows.
"KV cache turns generation from replaying history into extending state."
Primary references and the companion notebook for today's exercise.