Capstone Part 4: Quantization, Fusion, Benchmarks, and Postmortem

The final day turns a correct educational engine into a measured artifact: add weight-only quantization, fuse one hot operation, run benchmarks, compare against a production engine, and write the postmortem.

Time~260 min

DifficultyHard

PrerequisiteDays 22, 27-29

Notebookday-30-capstone-pt4

Why This Lesson

Why this optimization matters.

An inference engine is not finished because it generates text. It is finished when you can measure latency, throughput, memory, quality impact, and the next bottleneck. Day 30 makes the capstone honest.

Learning Objectives

What you should be able to do today.

Add per-channel INT8 weight-only quantization to linear layers.
Optionally add group INT4 for weights.
Fuse one operation path or use torch.compile as a baseline.
Run a benchmark matrix for TTFT, TPOT, throughput, and peak memory.
Compare with a production engine on the same task.
Write a one-page engineering postmortem.

Notation Cheatsheet

Decode the symbols before using them.

TTFT is time to first token.
ITL is inter-token latency.
TPOT is time per output token.
peak_mem is maximum device memory used during the run.
speedup is baseline time divided by optimized time.

Weight-Only INT8

Start with the simplest useful quantizer.

For a linear weight matrix [out_features, in_features], compute one scale per output row. Store int8_weight and fp16_scale. A [2048, 2048] FP16 matrix is 8,388,608 bytes; INT8 plus per-row FP16 scales is about 4,198,400 bytes, nearly 2x smaller.

The activation path stays precise while stored weights shrink.

Fuse One Hot Path

Fuse one measured hot path.

Fusion reduces memory traffic and kernel launches. Good capstone targets are RMSNorm plus residual bookkeeping, or SwiGLU gate/up projection fusion. If custom kernels are too much, use torch.compile and measure eager versus compiled.

Fusion wins by removing intermediate memory traffic and launches.

Benchmark Matrix

Leave unknown benchmark cells empty until measured.

Setup	Batch/concurrency	Prompt tokens	Output tokens	TTFT ms	tok/s	Peak memory	Notes
yours FP16	1	TBD	TBD	TBD	TBD	TBD	baseline
yours INT8	1	TBD	TBD	TBD	TBD	TBD	weight-only
yours fused	1	TBD	TBD	TBD	TBD	TBD	fusion target
production engine	1	TBD	TBD	TBD	TBD	TBD	same model/task

Do not fill this table from memory. The notebook writes the schema; the student fills it with measured values.

A speedup without quality and memory context is incomplete.

Gap Analysis

Explain the gap with evidence.

After comparison, identify the largest gap. Likely causes: no FlashAttention decode kernel, Python scheduler overhead, no CUDA Graphs, inefficient dequantization, padding waste, sampling overhead, or tokenizer/server overhead. A professional postmortem separates measured facts from guesses.

The postmortem should map measured symptoms to plausible components.

Final Deliverable Checklist

A final system needs a final rubric.

Model loader works on real weights.
Single-sequence forward verifies against reference logits.
KV cache decode matches uncached outputs.
Sampling loop supports greedy and at least one stochastic mode.
Block pool and page table run multi-sequence scheduling.
INT8 quantization reduces weight memory.
One fusion or compile path is benchmarked.
Benchmark table includes real hardware values.
Postmortem states the biggest bottleneck and next optimization.

The final artifact is a measured system and an honest engineering analysis.

Did You Know?

A detail worth remembering.

torch.compile is not a magic production engine, but it is a useful capstone tool because it can expose which Python-level operations are preventing fusion.

Exercise

Build the habit with code.

Implement per-channel INT8 quantization for all linear weights or a representative subset.
Benchmark FP16 versus INT8 memory and latency.
Fuse one hot path or run torch.compile and compare eager versus optimized.
Run the same model/task on one production engine available on your hardware.
Write a one-page postmortem with benchmark table and next optimization.

Self-Check

Answer these from memory.

Why weight-only INT8 first? It is simple, gives about 2x weight memory reduction, and keeps activations precise.
What does fusion reduce? Intermediate memory traffic and kernel launch overhead.
Why compare to a production engine? It calibrates how far the educational engine is from real serving stacks.
What makes a benchmark professional? Pinned versions, same workload, warmups, real hardware metrics, and honest notes.
What is the final goal of the capstone? A correct, measured engine plus a clear explanation of remaining bottlenecks.