Production serving is not one request at a time. Continuous batching keeps the GPU busy between token steps, and PagedAttention stores KV cache like virtual memory so many uneven requests can share one pool.
Day 20 made KV cache memory concrete. Day 21 made attention IO concrete. Day 24 adds the serving problem: requests arrive and finish at different times. Static batches waste slots, while naive KV allocation fragments memory. PagedAttention is the conceptual bridge from a model forward pass to a real inference engine scheduler.
block_size is tokens per physical KV block, often 16 or 32.logical block is a sequence-local block index.physical block is an actual slot in the global KV memory pool.page table maps logical blocks to physical blocks.free list contains physical blocks not currently used by any sequence.Four requests have output lengths [32, 128, 64, 256]. A static batch runs until the longest request reaches 256 steps. It reserves 4 * 256 = 1024 token slots. Useful tokens are only 480, so efficiency is 480 / 1024 = 46.9%. More than half the decode slots are padding or already-finished requests.
Continuous batching schedules at the iteration level. After every decode step, completed sequences leave. New waiting sequences enter immediately if memory and token budget allow. The batch is no longer a fixed group that starts and ends together; it is a changing set of running sequences.
PagedAttention borrows from operating systems. Instead of requiring one contiguous KV slab per sequence, the engine divides KV memory into fixed-size physical blocks. Each sequence owns a page table from logical block index to physical block index. Position 37 with block_size = 16 maps to logical block 2 and offset 5.
For a GQA model with 32 layers, 8 KV heads, head_dim = 128, FP16, and block_size = 16, one block costs 32 16 2 8 128 * 2 = 2,097,152 bytes, about 2 MB. If a GPU has 60 GB free after weights, it can hold roughly 30,000 such blocks.
Many chat requests share the same system prompt. Without sharing, 16 conversations with a 200-token system prompt duplicate about 16 * ceil(200/16) = 208 blocks. With copy-on-write, they point to the same 13 physical blocks until user-specific tokens diverge.
PagedAttention raises the effective batch size because memory follows actual tokens, not maximum configured sequence length. Continuous batching raises throughput because the GPU sees a steady stream of work. Together they explain why production engines can beat simple generation loops under concurrency.
BlockPool.alloc() and BlockPool.free() with a free list.PageTable.position_to_block(position) for block_size = 16.[32, 128, 64, 256] and compare used blocks with static allocation."Continuous batching keeps compute full; PagedAttention keeps memory flexible."
Primary references and the companion notebook for today's exercise.