Day 22 · Week 4 · Optimization & Capstone

▦

Quantization Deep Dive: FP8, INT8, INT4, GPTQ, AWQ, GGUF

Quantization is the practical reason a model that wants a server can fit on a workstation, a laptop, or a phone. Today you will reduce memory with arithmetic you can compute by hand, then connect the equations to GPTQ, AWQ, GGUF, FP8, and KV-cache quantization.

Time~210 min

DifficultyMedium-Hard

PrerequisiteDays 13, 20

Notebookday-22-int8-int4-quantization

Why This Lesson

Why this optimization matters.

Day 20 showed that inference is a memory game: weights must be read for every decode step, and KV cache grows with sequence length and batch size. Quantization attacks both costs by storing numbers in fewer bits while keeping enough information for the model to behave almost the same. The point is not to memorize quantization brand names. The point is to know which error each method introduces, where that error shows up, and when the serving win is worth it.

Learning Objectives

What you should be able to do today.

Compute scale, zero point, quantized values, dequantized values, and round-trip error for a concrete vector.
Compare FP32, FP16, BF16, FP8, INT8, INT4, and NF4 by byte cost, dynamic range, and inference use case.
Explain why per-channel and per-group quantization usually beat one scale for an entire tensor.
Describe GPTQ, AWQ, GGUF K-quants, and KV-cache quantization without hand-waving.
Choose a quantization scheme for CPU, Apple Silicon, NVIDIA throughput, and quality-sensitive deployments.

Notation Cheatsheet

Decode the symbols before using them.

b is bit width. INT8 has b = 8; INT4 has b = 4.
q is the stored integer after quantization.
scale converts an integer step back into model-value units.
zero_point is the integer value that represents real value 0 in asymmetric quantization.
MSE means mean squared error: average of (original - dequantized)^2.
group_size is the number of weights that share one scale, often 32, 64, or 128.

One Row First

Start with one row of weights.

Quantization is just a ruler. Suppose a row of weights is [0.10, -0.30, 0.70, -1.00, 0.50]. The minimum is -1.00, the maximum is 0.70, and INT8 has 256 representable levels.

scale = (max - min) / (2^8 - 1)
      = (0.70 - (-1.00)) / 255
      = 1.70 / 255
      = 0.00667

q = round((x - min) / scale)

For x = 0.10, the integer is round((0.10 + 1.00) / 0.00667) = round(165.0) = 165. Dequantization reverses the map: x_hat = min + q scale = -1.00 + 165 0.00667 = 0.10. The worst possible rounding error is half a step: scale / 2 = 0.00333.

The operation is a lossy ruler: store an integer, recover an approximation.

Formats Are Not Interchangeable

Bytes, range, and precision are separate decisions.

Format	Bytes/param	Range intuition	Typical inference role
FP32	4.0	Huge range, high precision	Training reference, CPU math
FP16	2.0	Smaller range, good precision	Common GPU inference weights
BF16	2.0	FP32-like range, fewer mantissa bits	Safer training/inference activations
FP8	1.0	Small precision, hardware-dependent	H100/B200-style throughput path
INT8	1.0	256 uniform levels	Weight-only inference, KV cache
INT4	0.5	16 levels	Consumer deployment, GGUF/GPTQ/AWQ
NF4	0.5	16 nonuniform normal levels	QLoRA-style Gaussian weights

A float format spends bits on sign, exponent, and mantissa. An integer format spends all bits on discrete levels inside a scale range. That is why INT4 can be excellent for frozen weights but awkward for activations whose range changes every batch. Weight-only quantization keeps activations in FP16/BF16 and stores weights as integers.

Weight-only quantization changes the first-order memory budget immediately.

Granularity Buys Accuracy

One scale is rarely enough.

Per-tensor quantization uses one scale for a whole matrix. If one row has an outlier, every row pays for that outlier with a coarse step. Per-channel quantization uses one scale per output row. Per-group quantization uses one scale per small chunk within a row.

Concrete split: if [0.10, -0.30, 0.70, -1.00, 0.50] shares one scale, the range is 1.70. If you split it into groups, each group gets a tighter ruler. Real LLM matrices have many rows with very different ranges, so per-channel and per-group are usually large wins.

Smaller groups use tighter ranges, so each integer step is more precise.

GPTQ, AWQ, GGUF

The named methods protect different failure modes.

GPTQ is post-training quantization that treats quantization as an error-compensation problem. It estimates which columns matter more using approximate second-order information. You can think of the Hessian as a sensitivity table: if changing a weight causes loss to move a lot, quantize it carefully and compensate nearby weights.

AWQ starts from a serving observation: a small number of activation channels dominate output quality. It rescales important channels before quantization so those channels receive more effective resolution, then rescales after the linear layer. GGUF is the file format and quantization family used by llama.cpp; names like Q4_K_M encode group quantization choices tuned for local runtimes.

The method is only useful when it matches the runtime and bottleneck.

KV Cache Quantization

Long context makes cache quantization matter.

Weights are not the only memory consumer. From Day 20, a LLaMA-2 7B-style FP16 KV cache at context 4096 is about 2 GB per batch item. INT8 KV halves that. FP8 KV can also halve it while mapping naturally to newer NVIDIA hardware.

The caution is that KV cache stores activations, not fixed weights. Its distribution changes with prompt, position, and layer. That means KV quantization usually needs per-head or per-token scaling and a careful quality check.

Activations stay high precision; only stored weights shrink.

Decision Rule

Pick the scheme from the bottleneck.

Use FP16/BF16 when correctness is the baseline or the model is small enough. Use INT8 weight-only when you want an easy 2x memory cut with minimal quality risk. Use GPTQ or AWQ INT4 when weights dominate memory and you have calibrated the exact model. Use GGUF when the runtime is llama.cpp or Apple/CPU local inference. Use FP8 when the hardware and engine have first-class support.

Exercise

Build the habit with code.

Implement per-tensor INT8 quantize/dequantize for a NumPy matrix and compute MSE.
Switch to per-channel scales and verify MSE drops on rows with different ranges.
Compute the weight memory for 7B and 13B models at FP16, INT8, and INT4.
Optional hardware path: run a small GGUF or AWQ model and record tokens/sec, peak memory, and output quality notes.

Self-Check

Answer these from memory.

Why does per-channel quantization usually beat per-tensor quantization? Each output row gets its own range, so one outlier row does not force coarse steps for every row.
What does weight-only quantization leave in FP16/BF16? The activations.
Why is INT4 harder than INT8? Sixteen levels leave little room for outliers, so grouping, calibration, and good kernels matter more.
What problem does AWQ solve? It protects activation-important channels by scaling them before quantization and undoing that scale later.
When is GGUF the practical answer? Local inference with llama.cpp, CPU, Apple Silicon, or consumer hardware where GGUF kernels are mature.

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

GPTQ

Hessian-aware post-training quantization.

Open

Paper

AWQ

Activation-aware weight quantization.

Open

Paper

QLoRA / NF4

NormalFloat and double quantization.

Open

Repo

llama.cpp GGUF

The dominant local-inference format family.

Open

Notebook

Day 22 notebook

Runnable companion notebook for the lesson.

Open notebook

Quantization Deep Dive: FP8, INT8, INT4, GPTQ, AWQ, GGUF

Why this optimization matters.

What you should be able to do today.

Decode the symbols before using them.

Start with one row of weights.

Bytes, range, and precision are separate decisions.

One scale is rarely enough.

The named methods protect different failure modes.

Long context makes cache quantization matter.

Pick the scheme from the bottleneck.

A detail worth remembering.

Build the habit with code.

Answer these from memory.

Go deeper.

GPTQ

AWQ

QLoRA / NF4

llama.cpp GGUF

Day 22 notebook