Single-sequence generation is a demo. A serving engine needs request state, admission, block allocation, preemption, and batched decode over variable-length sequences.
Day 24 explained the idea. Day 29 turns it into your capstone scheduler. The goal is not to reproduce vLLM; it is to write the smallest implementation that proves you understand sequences, page tables, block pools, and continuous batching.
Sequence and Scheduler data structures.BlockPool and per-sequence PageTable.waiting means request has arrived but lacks memory or batch budget.prefill means prompt tokens are being processed.running means decode steps are active.done means EOS or max_new_tokens reached.preempt means free a sequence and requeue it under memory pressure.A minimal sequence record needs ID, prompt tokens, generated tokens, page table, state, max_new_tokens, and stop status. The scheduler owns waiting, running, and done lists. The block pool owns physical memory.
Each iteration starts by freeing completed sequences. Then the scheduler admits waiting requests if the pool has enough blocks and the token budget allows it. If memory pressure appears during decode, the simple policy is to preempt the longest-running or lowest-priority sequence.
For position pos, compute logical = pos // block_size and offset = pos % block_size. The page table returns the physical block. A real PagedAttention kernel uses this table inside GPU memory reads. Your simulation proves the allocator before kernel optimization.
The simple batched path pads shorter sequences to T_max and masks padding. That is acceptable for the capstone because the focus is scheduler behavior. Correctness check: scheduler outputs should match solo greedy outputs.
For lengths [32, 128, 64, 256] and block size 16, actual blocks are 2 + 8 + 4 + 16 = 30. Static allocation at max length for batch 4 reserves 64 blocks. Your chart should show blocks in use rising and falling as sequences arrive and finish.
BlockPool, PageTable, Sequence, and Scheduler in the notebook or engine package."The scheduler is the inference engine part users never see and always feel."
Primary references and the companion notebook for today's exercise.