The final day turns a correct educational engine into a measured artifact: add weight-only quantization, fuse one hot operation, run benchmarks, compare against a production engine, and write the postmortem.
An inference engine is not finished because it generates text. It is finished when you can measure latency, throughput, memory, quality impact, and the next bottleneck. Day 30 makes the capstone honest.
torch.compile as a baseline.TTFT is time to first token.ITL is inter-token latency.TPOT is time per output token.peak_mem is maximum device memory used during the run.speedup is baseline time divided by optimized time.For a linear weight matrix [out_features, in_features], compute one scale per output row. Store int8_weight and fp16_scale. A [2048, 2048] FP16 matrix is 8,388,608 bytes; INT8 plus per-row FP16 scales is about 4,198,400 bytes, nearly 2x smaller.
Fusion reduces memory traffic and kernel launches. Good capstone targets are RMSNorm plus residual bookkeeping, or SwiGLU gate/up projection fusion. If custom kernels are too much, use torch.compile and measure eager versus compiled.
| Setup | Batch/concurrency | Prompt tokens | Output tokens | TTFT ms | tok/s | Peak memory | Notes |
|---|---|---|---|---|---|---|---|
| yours FP16 | 1 | TBD | TBD | TBD | TBD | TBD | baseline |
| yours INT8 | 1 | TBD | TBD | TBD | TBD | TBD | weight-only |
| yours fused | 1 | TBD | TBD | TBD | TBD | TBD | fusion target |
| production engine | 1 | TBD | TBD | TBD | TBD | TBD | same model/task |
Do not fill this table from memory. The notebook writes the schema; the student fills it with measured values.
After comparison, identify the largest gap. Likely causes: no FlashAttention decode kernel, Python scheduler overhead, no CUDA Graphs, inefficient dequantization, padding waste, sampling overhead, or tokenizer/server overhead. A professional postmortem separates measured facts from guesses.
torch.compile is not a magic production engine, but it is a useful capstone tool because it can expose which Python-level operations are preventing fusion.torch.compile and compare eager versus optimized."The capstone ends when the engine has numbers, not when it has vibes."
Primary references and the companion notebook for today's exercise.