Day 29 · Week 4 · Optimization & Capstone

▣

Capstone Part 3: Continuous Batching and Paged KV Cache

Single-sequence generation is a demo. A serving engine needs request state, admission, block allocation, preemption, and batched decode over variable-length sequences.

Time~250 min

DifficultyHard

PrerequisiteDays 24, 28

Notebookday-29-capstone-pt3

Why This Lesson

Why this optimization matters.

Day 24 explained the idea. Day 29 turns it into your capstone scheduler. The goal is not to reproduce vLLM; it is to write the smallest implementation that proves you understand sequences, page tables, block pools, and continuous batching.

Learning Objectives

What you should be able to do today.

Define Sequence and Scheduler data structures.
Implement BlockPool and per-sequence PageTable.
Admit waiting requests when enough blocks are free.
Run multiple sequences through one scheduler loop.
Track block utilization and compare to static allocation.

Notation Cheatsheet

Decode the symbols before using them.

waiting means request has arrived but lacks memory or batch budget.
prefill means prompt tokens are being processed.
running means decode steps are active.
done means EOS or max_new_tokens reached.
preempt means free a sequence and requeue it under memory pressure.

Data Structures

Separate requests, scheduler, and memory.

A minimal sequence record needs ID, prompt tokens, generated tokens, page table, state, max_new_tokens, and stop status. The scheduler owns waiting, running, and done lists. The block pool owns physical memory.

Requests move between queues as memory and tokens become available.

Admission and Preemption

Admission is constrained by blocks.

Each iteration starts by freeing completed sequences. Then the scheduler admits waiting requests if the pool has enough blocks and the token budget allows it. If memory pressure appears during decode, the simple policy is to preempt the longest-running or lowest-priority sequence.

Each sequence owns a small logical-to-physical table.

Paged KV in the Capstone

Simulate page lookup before writing kernels.

For position pos, compute logical = pos // block_size and offset = pos % block_size. The page table returns the physical block. A real PagedAttention kernel uses this table inside GPU memory reads. Your simulation proves the allocator before kernel optimization.

Continuous batching changes the active set over time.

Variable-Length Batching

Padding is acceptable for the first batched engine.

The simple batched path pads shorter sequences to T_max and masks padding. That is acceptable for the capstone because the focus is scheduler behavior. Correctness check: scheduler outputs should match solo greedy outputs.

The capstone should report block savings, not just tokens/sec.

Utilization Evidence

The allocator must produce evidence.

For lengths [32, 128, 64, 256] and block size 16, actual blocks are 2 + 8 + 4 + 16 = 30. Static allocation at max length for batch 4 reserves 64 blocks. Your chart should show blocks in use rising and falling as sequences arrive and finish.

One iteration corresponds to one generated-token step per running sequence.

Exercise

Build the habit with code.

Implement BlockPool, PageTable, Sequence, and Scheduler in the notebook or engine package.
Simulate arrivals at times 0, 2, 4, and 6 with varied output lengths.
Plot used blocks over time and record peak usage.
Compare scheduler outputs with solo greedy generation for the same prompts.

Self-Check

Answer these from memory.

How does the scheduler know a sequence is done? EOS token or max_new_tokens reached.
What happens when the block pool is exhausted? Do not admit more work, or preempt an existing sequence according to policy.
Why pad in the simple batched path? It keeps matmul shapes rectangular while masks hide padding.
What is the page table growth rule? Allocate a new logical block when the next token would cross a block boundary.
What proves paged cache is working? Peak used blocks are far below static worst case and outputs match solo decoding.

Go deeper.

Primary references and the companion notebook for today's exercise.

Lesson

Day 24 PagedAttention

Conceptual base for this build.

Open notebook

Source

vLLM scheduler

Production scheduler reference.

Open

Notebook

Day 29 notebook

Runnable companion notebook for the lesson.