LLM Inference Engineer · Day 30
Day 30 · Week 4 · Optimization & Capstone

Capstone Part 4: Quantization, Fusion, Benchmarks, and Postmortem

The final day turns a correct educational engine into a measured artifact: add weight-only quantization, fuse one hot operation, run benchmarks, compare against a production engine, and write the postmortem.

Time~260 min
DifficultyHard
PrerequisiteDays 22, 27-29
Notebookday-30-capstone-pt4
Why This Lesson

Why this optimization matters.

An inference engine is not finished because it generates text. It is finished when you can measure latency, throughput, memory, quality impact, and the next bottleneck. Day 30 makes the capstone honest.

Learning Objectives

What you should be able to do today.

  1. Add per-channel INT8 weight-only quantization to linear layers.
  2. Optionally add group INT4 for weights.
  3. Fuse one operation path or use torch.compile as a baseline.
  4. Run a benchmark matrix for TTFT, TPOT, throughput, and peak memory.
  5. Compare with a production engine on the same task.
  6. Write a one-page engineering postmortem.
Notation Cheatsheet

Decode the symbols before using them.

  • TTFT is time to first token.
  • ITL is inter-token latency.
  • TPOT is time per output token.
  • peak_mem is maximum device memory used during the run.
  • speedup is baseline time divided by optimized time.
Weight-Only INT8

Start with the simplest useful quantizer.

For a linear weight matrix [out_features, in_features], compute one scale per output row. Store int8_weight and fp16_scale. A [2048, 2048] FP16 matrix is 8,388,608 bytes; INT8 plus per-row FP16 scales is about 4,198,400 bytes, nearly 2x smaller.

Weight-Only Quantized Linear FP16 activations INT8 weight row scale dequant/matmul FP16 output The activation path stays precise while stored weights shrink.
The activation path stays precise while stored weights shrink.
Fuse One Hot Path

Fuse one measured hot path.

Fusion reduces memory traffic and kernel launches. Good capstone targets are RMSNorm plus residual bookkeeping, or SwiGLU gate/up projection fusion. If custom kernels are too much, use torch.compile and measure eager versus compiled.

Unfused vs Fused Work unfused read norm write read add fused read once compute write once Fusion wins by removing intermediate memory traffic and launches.
Fusion wins by removing intermediate memory traffic and launches.
Benchmark Matrix

Leave unknown benchmark cells empty until measured.

SetupBatch/concurrencyPrompt tokensOutput tokensTTFT mstok/sPeak memoryNotes
yours FP161TBDTBDTBDTBDTBDbaseline
yours INT81TBDTBDTBDTBDTBDweight-only
yours fused1TBDTBDTBDTBDTBDfusion target
production engine1TBDTBDTBDTBDTBDsame model/task

Do not fill this table from memory. The notebook writes the schema; the student fills it with measured values.

Benchmark Axes TTFT lower is better latency tok/s higher is better throughput memory lower is better capacity quality delta small correctness A speedup without quality and memory context is incomplete.
A speedup without quality and memory context is incomplete.
Gap Analysis

Explain the gap with evidence.

After comparison, identify the largest gap. Likely causes: no FlashAttention decode kernel, Python scheduler overhead, no CUDA Graphs, inefficient dequantization, padding waste, sampling overhead, or tokenizer/server overhead. A professional postmortem separates measured facts from guesses.

Likely Gap Causes attention no flash kernel scheduler Python overhead quant matmul slow dequant sampling CPU sync tokenizer/API outside model The postmortem should map measured symptoms to plausible components.
The postmortem should map measured symptoms to plausible components.
Final Deliverable Checklist

A final system needs a final rubric.

  • Model loader works on real weights.
  • Single-sequence forward verifies against reference logits.
  • KV cache decode matches uncached outputs.
  • Sampling loop supports greedy and at least one stochastic mode.
  • Block pool and page table run multi-sequence scheduling.
  • INT8 quantization reduces weight memory.
  • One fusion or compile path is benchmarked.
  • Benchmark table includes real hardware values.
  • Postmortem states the biggest bottleneck and next optimization.
Capstone Finish Loop measure compare profile explain next step The final artifact is a measured system and an honest engineering analysis.
The final artifact is a measured system and an honest engineering analysis.
Did You Know?

A detail worth remembering.

torch.compile is not a magic production engine, but it is a useful capstone tool because it can expose which Python-level operations are preventing fusion.
Exercise

Build the habit with code.

  1. Implement per-channel INT8 quantization for all linear weights or a representative subset.
  2. Benchmark FP16 versus INT8 memory and latency.
  3. Fuse one hot path or run torch.compile and compare eager versus optimized.
  4. Run the same model/task on one production engine available on your hardware.
  5. Write a one-page postmortem with benchmark table and next optimization.
Self-Check

Answer these from memory.

  1. Why weight-only INT8 first? It is simple, gives about 2x weight memory reduction, and keeps activations precise.
  2. What does fusion reduce? Intermediate memory traffic and kernel launch overhead.
  3. Why compare to a production engine? It calibrates how far the educational engine is from real serving stacks.
  4. What makes a benchmark professional? Pinned versions, same workload, warmups, real hardware metrics, and honest notes.
  5. What is the final goal of the capstone? A correct, measured engine plus a clear explanation of remaining bottlenecks.

"The capstone ends when the engine has numbers, not when it has vibes."

Day 30 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Repo

vLLM

Production comparison target.

Open
Repo

llama.cpp

Local comparison target.

Open
Docs

TensorRT-LLM best practices

Benchmarking and optimization context.

Open
Blog

Tim Dettmers LLM.int8

Quantization intuition.

Open
Notebook

Day 30 notebook

Runnable companion notebook for the lesson.

Open notebook