LLM Inference Engineer · Day 29
Day 29 · Week 4 · Optimization & Capstone

Capstone Part 3: Continuous Batching and Paged KV Cache

Single-sequence generation is a demo. A serving engine needs request state, admission, block allocation, preemption, and batched decode over variable-length sequences.

Time~250 min
DifficultyHard
PrerequisiteDays 24, 28
Notebookday-29-capstone-pt3
Why This Lesson

Why this optimization matters.

Day 24 explained the idea. Day 29 turns it into your capstone scheduler. The goal is not to reproduce vLLM; it is to write the smallest implementation that proves you understand sequences, page tables, block pools, and continuous batching.

Learning Objectives

What you should be able to do today.

  1. Define Sequence and Scheduler data structures.
  2. Implement BlockPool and per-sequence PageTable.
  3. Admit waiting requests when enough blocks are free.
  4. Run multiple sequences through one scheduler loop.
  5. Track block utilization and compare to static allocation.
Notation Cheatsheet

Decode the symbols before using them.

  • waiting means request has arrived but lacks memory or batch budget.
  • prefill means prompt tokens are being processed.
  • running means decode steps are active.
  • done means EOS or max_new_tokens reached.
  • preempt means free a sequence and requeue it under memory pressure.
Data Structures

Separate requests, scheduler, and memory.

A minimal sequence record needs ID, prompt tokens, generated tokens, page table, state, max_new_tokens, and stop status. The scheduler owns waiting, running, and done lists. The block pool owns physical memory.

Sequence State Machine waiting prefill running done preempted Requests move between queues as memory and tokens become available.
Requests move between queues as memory and tokens become available.
Admission and Preemption

Admission is constrained by blocks.

Each iteration starts by freeing completed sequences. Then the scheduler admits waiting requests if the pool has enough blocks and the token budget allows it. If memory pressure appears during decode, the simple policy is to preempt the longest-running or lowest-priority sequence.

Sequence Page Tables seq A block 0 phys 03 seq A block 1 phys 19 seq B block 0 phys 07 seq B block 1 phys 21 seq B block 2 phys 22 Each sequence owns a small logical-to-physical table.
Each sequence owns a small logical-to-physical table.
Paged KV in the Capstone

Simulate page lookup before writing kernels.

For position pos, compute logical = pos // block_size and offset = pos % block_size. The page table returns the physical block. A real PagedAttention kernel uses this table inside GPU memory reads. Your simulation proves the allocator before kernel optimization.

Arrivals and Completion seq A run done seq B wait run seq C wait run Continuous batching changes the active set over time.
Continuous batching changes the active set over time.
Variable-Length Batching

Padding is acceptable for the first batched engine.

The simple batched path pads shorter sequences to T_max and masks padding. That is acceptable for the capstone because the focus is scheduler behavior. Correctness check: scheduler outputs should match solo greedy outputs.

Static vs Paged Blocks static max 64 blocks paged actual 30 blocks with sharing example The capstone should report block savings, not just tokens/sec.
The capstone should report block savings, not just tokens/sec.
Utilization Evidence

The allocator must produce evidence.

For lengths [32, 128, 64, 256] and block size 16, actual blocks are 2 + 8 + 4 + 16 = 30. Static allocation at max length for batch 4 reserves 64 blocks. Your chart should show blocks in use rising and falling as sequences arrive and finish.

Scheduler Iteration free done admit build batch decode step allocate if full One iteration corresponds to one generated-token step per running sequence.
One iteration corresponds to one generated-token step per running sequence.
Did You Know?

A detail worth remembering.

The scheduler is where model math meets operations policy: fairness, latency, memory, and throughput all become one loop.
Exercise

Build the habit with code.

  1. Implement BlockPool, PageTable, Sequence, and Scheduler in the notebook or engine package.
  2. Simulate arrivals at times 0, 2, 4, and 6 with varied output lengths.
  3. Plot used blocks over time and record peak usage.
  4. Compare scheduler outputs with solo greedy generation for the same prompts.
Self-Check

Answer these from memory.

  1. How does the scheduler know a sequence is done? EOS token or max_new_tokens reached.
  2. What happens when the block pool is exhausted? Do not admit more work, or preempt an existing sequence according to policy.
  3. Why pad in the simple batched path? It keeps matmul shapes rectangular while masks hide padding.
  4. What is the page table growth rule? Allocate a new logical block when the next token would cross a block boundary.
  5. What proves paged cache is working? Peak used blocks are far below static worst case and outputs match solo decoding.

"The scheduler is the inference engine part users never see and always feel."

Day 29 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Lesson

Day 24 PagedAttention

Conceptual base for this build.

Open notebook
Source

vLLM scheduler

Production scheduler reference.

Open
Notebook

Day 29 notebook

Runnable companion notebook for the lesson.

Open notebook