LLM Inference Engineer · Day 24
Day 24 · Week 4 · Optimization & Capstone

Continuous Batching and PagedAttention: vLLM Internals

Production serving is not one request at a time. Continuous batching keeps the GPU busy between token steps, and PagedAttention stores KV cache like virtual memory so many uneven requests can share one pool.

Time~230 min
DifficultyHard
PrerequisiteDays 20, 21
Notebookday-24-block-pool-and-scheduler
Why This Lesson

Why this optimization matters.

Day 20 made KV cache memory concrete. Day 21 made attention IO concrete. Day 24 adds the serving problem: requests arrive and finish at different times. Static batches waste slots, while naive KV allocation fragments memory. PagedAttention is the conceptual bridge from a model forward pass to a real inference engine scheduler.

Learning Objectives

What you should be able to do today.

  1. Compute static batching efficiency for a mix of sequence lengths.
  2. Explain iteration-level continuous batching.
  3. Implement the ideas behind a block pool and page table.
  4. Describe how an attention kernel reads KV through a block table.
  5. Use copy-on-write to reason about shared prefixes and beam search.
Notation Cheatsheet

Decode the symbols before using them.

  • block_size is tokens per physical KV block, often 16 or 32.
  • logical block is a sequence-local block index.
  • physical block is an actual slot in the global KV memory pool.
  • page table maps logical blocks to physical blocks.
  • free list contains physical blocks not currently used by any sequence.
Static Batching Wastes Slots

The longest sequence quietly taxes the whole batch.

Four requests have output lengths [32, 128, 64, 256]. A static batch runs until the longest request reaches 256 steps. It reserves 4 * 256 = 1024 token slots. Useful tokens are only 480, so efficiency is 480 / 1024 = 46.9%. More than half the decode slots are padding or already-finished requests.

Static vs Continuous Batching static A done idle wait straggler continuous A new E new F steady work Continuous batching fills slots after each iteration instead of waiting for the longest request.
Continuous batching fills slots after each iteration instead of waiting for the longest request.
Continuous Batching

Admit and retire work every iteration.

Continuous batching schedules at the iteration level. After every decode step, completed sequences leave. New waiting sequences enter immediately if memory and token budget allow. The batch is no longer a fixed group that starts and ends together; it is a changing set of running sequences.

Scheduler Loop free done admit waiting run one step update blocks preempt if needed The scheduler mutates the running batch every token step.
The scheduler mutates the running batch every token step.
PagedAttention

Make KV cache look contiguous without storing it contiguously.

PagedAttention borrows from operating systems. Instead of requiring one contiguous KV slab per sequence, the engine divides KV memory into fixed-size physical blocks. Each sequence owns a page table from logical block index to physical block index. Position 37 with block_size = 16 maps to logical block 2 and offset 5.

Page Table Lookup logical block 0 physical 17 logical block 1 physical 04 logical block 2 physical 29 logical block 3 physical 08 A sequence can be logically contiguous while its KV blocks are scattered.
A sequence can be logically contiguous while its KV blocks are scattered.
Memory Math

A block is a concrete memory unit.

For a GQA model with 32 layers, 8 KV heads, head_dim = 128, FP16, and block_size = 16, one block costs 32 16 2 8 128 * 2 = 2,097,152 bytes, about 2 MB. If a GPU has 60 GB free after weights, it can hold roughly 30,000 such blocks.

Global Block Pool seq A 6/16 active seq B 9/16 active free 16/16 active The block pool is a free-list-managed cache arena.
The block pool is a free-list-managed cache arena.
Copy-on-Write Prefixes

Shared prefixes should not duplicate cache.

Many chat requests share the same system prompt. Without sharing, 16 conversations with a 200-token system prompt duplicate about 16 * ceil(200/16) = 208 blocks. With copy-on-write, they point to the same 13 physical blocks until user-specific tokens diverge.

Copy-on-Write Prefix Sharing system prompt shared blocks seq A reads seq B reads B diverges allocate new block Shared prefixes save memory until a sequence writes a divergent block.
Shared prefixes save memory until a sequence writes a divergent block.
What an Inference Engineer Cares About

Throughput and memory have to be solved together.

PagedAttention raises the effective batch size because memory follows actual tokens, not maximum configured sequence length. Continuous batching raises throughput because the GPU sees a steady stream of work. Together they explain why production engines can beat simple generation loops under concurrency.

Block Allocation Example static worst case 64 blocks paged actual lengths 30 blocks shared prefix CoW 17 + shared The savings come from allocating actual blocks, not theoretical maximum length.
The savings come from allocating actual blocks, not theoretical maximum length.
Did You Know?

A detail worth remembering.

PagedAttention made KV cache feel like virtual memory: the model sees a long contiguous context, while the allocator sees small blocks that can be reused and shared.
Exercise

Build the habit with code.

  1. Implement BlockPool.alloc() and BlockPool.free() with a free list.
  2. Implement PageTable.position_to_block(position) for block_size = 16.
  3. Simulate continuous batching for lengths [32, 128, 64, 256] and compare used blocks with static allocation.
  4. Add a simple copy-on-write prefix counter and estimate shared-block savings.
Self-Check

Answer these from memory.

  1. Why does static batching waste throughput? It reserves slots until the longest sequence finishes, so short sequences become idle padding.
  2. What does a page table store? A mapping from logical sequence block to physical KV block.
  3. Why does PagedAttention reduce fragmentation? It avoids large contiguous per-sequence allocations and reuses fixed-size blocks.
  4. How does CoW save memory? Multiple sequences share prefix blocks until one diverges and needs a new block.
  5. Why is Day 24 central to the capstone? Days 29-30 need a scheduler and block pool to move beyond single-sequence generation.

"Continuous batching keeps compute full; PagedAttention keeps memory flexible."

Day 24 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

PagedAttention / vLLM

The core virtual-memory idea for KV cache.

Open
Paper

ORCA

Iteration-level scheduling for LLM serving.

Open
Repo

vLLM block manager

Production allocator code to study.

Open
Docs

HuggingFace TGI

Production serving reference.

Open
Notebook

Day 24 notebook

Runnable companion notebook for the lesson.

Open notebook