Continuous Batching and PagedAttention: vLLM Internals

Production serving is not one request at a time. Continuous batching keeps the GPU busy between token steps, and PagedAttention stores KV cache like virtual memory so many uneven requests can share one pool.

Time~230 min

DifficultyHard

PrerequisiteDays 20, 21

Notebookday-24-block-pool-and-scheduler

Why This Lesson

Why this optimization matters.

Day 20 made KV cache memory concrete. Day 21 made attention IO concrete. Day 24 adds the serving problem: requests arrive and finish at different times. Static batches waste slots, while naive KV allocation fragments memory. PagedAttention is the conceptual bridge from a model forward pass to a real inference engine scheduler.

Learning Objectives

What you should be able to do today.

Compute static batching efficiency for a mix of sequence lengths.
Explain iteration-level continuous batching.
Implement the ideas behind a block pool and page table.
Describe how an attention kernel reads KV through a block table.
Use copy-on-write to reason about shared prefixes and beam search.

Notation Cheatsheet

Decode the symbols before using them.

block_size is tokens per physical KV block, often 16 or 32.
logical block is a sequence-local block index.
physical block is an actual slot in the global KV memory pool.
page table maps logical blocks to physical blocks.
free list contains physical blocks not currently used by any sequence.

Static Batching Wastes Slots

The longest sequence quietly taxes the whole batch.

Four requests have output lengths [32, 128, 64, 256]. A static batch runs until the longest request reaches 256 steps. It reserves 4 * 256 = 1024 token slots. Useful tokens are only 480, so efficiency is 480 / 1024 = 46.9%. More than half the decode slots are padding or already-finished requests.

Continuous batching fills slots after each iteration instead of waiting for the longest request.

Continuous Batching

Admit and retire work every iteration.

Continuous batching schedules at the iteration level. After every decode step, completed sequences leave. New waiting sequences enter immediately if memory and token budget allow. The batch is no longer a fixed group that starts and ends together; it is a changing set of running sequences.

The scheduler mutates the running batch every token step.

PagedAttention

Make KV cache look contiguous without storing it contiguously.

PagedAttention borrows from operating systems. Instead of requiring one contiguous KV slab per sequence, the engine divides KV memory into fixed-size physical blocks. Each sequence owns a page table from logical block index to physical block index. Position 37 with block_size = 16 maps to logical block 2 and offset 5.

A sequence can be logically contiguous while its KV blocks are scattered.

Memory Math

A block is a concrete memory unit.

For a GQA model with 32 layers, 8 KV heads, head_dim = 128, FP16, and block_size = 16, one block costs 32 16 2 8 128 * 2 = 2,097,152 bytes, about 2 MB. If a GPU has 60 GB free after weights, it can hold roughly 30,000 such blocks.

The block pool is a free-list-managed cache arena.

Copy-on-Write Prefixes

Shared prefixes should not duplicate cache.

Many chat requests share the same system prompt. Without sharing, 16 conversations with a 200-token system prompt duplicate about 16 * ceil(200/16) = 208 blocks. With copy-on-write, they point to the same 13 physical blocks until user-specific tokens diverge.

Shared prefixes save memory until a sequence writes a divergent block.

What an Inference Engineer Cares About

Throughput and memory have to be solved together.

PagedAttention raises the effective batch size because memory follows actual tokens, not maximum configured sequence length. Continuous batching raises throughput because the GPU sees a steady stream of work. Together they explain why production engines can beat simple generation loops under concurrency.

The savings come from allocating actual blocks, not theoretical maximum length.

Did You Know?

A detail worth remembering.

PagedAttention made KV cache feel like virtual memory: the model sees a long contiguous context, while the allocator sees small blocks that can be reused and shared.

Exercise

Build the habit with code.

Implement BlockPool.alloc() and BlockPool.free() with a free list.
Implement PageTable.position_to_block(position) for block_size = 16.
Simulate continuous batching for lengths [32, 128, 64, 256] and compare used blocks with static allocation.
Add a simple copy-on-write prefix counter and estimate shared-block savings.

Self-Check

Answer these from memory.

Why does static batching waste throughput? It reserves slots until the longest sequence finishes, so short sequences become idle padding.
What does a page table store? A mapping from logical sequence block to physical KV block.
Why does PagedAttention reduce fragmentation? It avoids large contiguous per-sequence allocations and reuses fixed-size blocks.
How does CoW save memory? Multiple sequences share prefix blocks until one diverges and needs a new block.
Why is Day 24 central to the capstone? Days 29-30 need a scheduler and block pool to move beyond single-sequence generation.