Quantization is the practical reason a model that wants a server can fit on a workstation, a laptop, or a phone. Today you will reduce memory with arithmetic you can compute by hand, then connect the equations to GPTQ, AWQ, GGUF, FP8, and KV-cache quantization.
Day 20 showed that inference is a memory game: weights must be read for every decode step, and KV cache grows with sequence length and batch size. Quantization attacks both costs by storing numbers in fewer bits while keeping enough information for the model to behave almost the same. The point is not to memorize quantization brand names. The point is to know which error each method introduces, where that error shows up, and when the serving win is worth it.
b is bit width. INT8 has b = 8; INT4 has b = 4.q is the stored integer after quantization.scale converts an integer step back into model-value units.zero_point is the integer value that represents real value 0 in asymmetric quantization.MSE means mean squared error: average of (original - dequantized)^2.group_size is the number of weights that share one scale, often 32, 64, or 128.Quantization is just a ruler. Suppose a row of weights is [0.10, -0.30, 0.70, -1.00, 0.50]. The minimum is -1.00, the maximum is 0.70, and INT8 has 256 representable levels.
scale = (max - min) / (2^8 - 1)
= (0.70 - (-1.00)) / 255
= 1.70 / 255
= 0.00667
q = round((x - min) / scale)
For x = 0.10, the integer is round((0.10 + 1.00) / 0.00667) = round(165.0) = 165. Dequantization reverses the map: x_hat = min + q scale = -1.00 + 165 0.00667 = 0.10. The worst possible rounding error is half a step: scale / 2 = 0.00333.
| Format | Bytes/param | Range intuition | Typical inference role |
|---|---|---|---|
| FP32 | 4.0 | Huge range, high precision | Training reference, CPU math |
| FP16 | 2.0 | Smaller range, good precision | Common GPU inference weights |
| BF16 | 2.0 | FP32-like range, fewer mantissa bits | Safer training/inference activations |
| FP8 | 1.0 | Small precision, hardware-dependent | H100/B200-style throughput path |
| INT8 | 1.0 | 256 uniform levels | Weight-only inference, KV cache |
| INT4 | 0.5 | 16 levels | Consumer deployment, GGUF/GPTQ/AWQ |
| NF4 | 0.5 | 16 nonuniform normal levels | QLoRA-style Gaussian weights |
A float format spends bits on sign, exponent, and mantissa. An integer format spends all bits on discrete levels inside a scale range. That is why INT4 can be excellent for frozen weights but awkward for activations whose range changes every batch. Weight-only quantization keeps activations in FP16/BF16 and stores weights as integers.
Per-tensor quantization uses one scale for a whole matrix. If one row has an outlier, every row pays for that outlier with a coarse step. Per-channel quantization uses one scale per output row. Per-group quantization uses one scale per small chunk within a row.
Concrete split: if [0.10, -0.30, 0.70, -1.00, 0.50] shares one scale, the range is 1.70. If you split it into groups, each group gets a tighter ruler. Real LLM matrices have many rows with very different ranges, so per-channel and per-group are usually large wins.
GPTQ is post-training quantization that treats quantization as an error-compensation problem. It estimates which columns matter more using approximate second-order information. You can think of the Hessian as a sensitivity table: if changing a weight causes loss to move a lot, quantize it carefully and compensate nearby weights.
AWQ starts from a serving observation: a small number of activation channels dominate output quality. It rescales important channels before quantization so those channels receive more effective resolution, then rescales after the linear layer. GGUF is the file format and quantization family used by llama.cpp; names like Q4_K_M encode group quantization choices tuned for local runtimes.
Weights are not the only memory consumer. From Day 20, a LLaMA-2 7B-style FP16 KV cache at context 4096 is about 2 GB per batch item. INT8 KV halves that. FP8 KV can also halve it while mapping naturally to newer NVIDIA hardware.
The caution is that KV cache stores activations, not fixed weights. Its distribution changes with prompt, position, and layer. That means KV quantization usually needs per-head or per-token scaling and a careful quality check.
Use FP16/BF16 when correctness is the baseline or the model is small enough. Use INT8 weight-only when you want an easy 2x memory cut with minimal quality risk. Use GPTQ or AWQ INT4 when weights dominate memory and you have calibrated the exact model. Use GGUF when the runtime is llama.cpp or Apple/CPU local inference. Use FP8 when the hardware and engine have first-class support.
"Quantization is not smaller math; it is the same model stored with a carefully chosen ruler."
Primary references and the companion notebook for today's exercise.