LLM Inference Engineer · Day 22
Day 22 · Week 4 · Optimization & Capstone

Quantization Deep Dive: FP8, INT8, INT4, GPTQ, AWQ, GGUF

Quantization is the practical reason a model that wants a server can fit on a workstation, a laptop, or a phone. Today you will reduce memory with arithmetic you can compute by hand, then connect the equations to GPTQ, AWQ, GGUF, FP8, and KV-cache quantization.

Time~210 min
DifficultyMedium-Hard
PrerequisiteDays 13, 20
Notebookday-22-int8-int4-quantization
Why This Lesson

Why this optimization matters.

Day 20 showed that inference is a memory game: weights must be read for every decode step, and KV cache grows with sequence length and batch size. Quantization attacks both costs by storing numbers in fewer bits while keeping enough information for the model to behave almost the same. The point is not to memorize quantization brand names. The point is to know which error each method introduces, where that error shows up, and when the serving win is worth it.

Learning Objectives

What you should be able to do today.

  1. Compute scale, zero point, quantized values, dequantized values, and round-trip error for a concrete vector.
  2. Compare FP32, FP16, BF16, FP8, INT8, INT4, and NF4 by byte cost, dynamic range, and inference use case.
  3. Explain why per-channel and per-group quantization usually beat one scale for an entire tensor.
  4. Describe GPTQ, AWQ, GGUF K-quants, and KV-cache quantization without hand-waving.
  5. Choose a quantization scheme for CPU, Apple Silicon, NVIDIA throughput, and quality-sensitive deployments.
Notation Cheatsheet

Decode the symbols before using them.

  • b is bit width. INT8 has b = 8; INT4 has b = 4.
  • q is the stored integer after quantization.
  • scale converts an integer step back into model-value units.
  • zero_point is the integer value that represents real value 0 in asymmetric quantization.
  • MSE means mean squared error: average of (original - dequantized)^2.
  • group_size is the number of weights that share one scale, often 32, 64, or 128.
One Row First

Start with one row of weights.

Quantization is just a ruler. Suppose a row of weights is [0.10, -0.30, 0.70, -1.00, 0.50]. The minimum is -1.00, the maximum is 0.70, and INT8 has 256 representable levels.

scale = (max - min) / (2^8 - 1)
      = (0.70 - (-1.00)) / 255
      = 1.70 / 255
      = 0.00667

q = round((x - min) / scale)

For x = 0.10, the integer is round((0.10 + 1.00) / 0.00667) = round(165.0) = 165. Dequantization reverses the map: x_hat = min + q scale = -1.00 + 165 0.00667 = 0.10. The worst possible rounding error is half a step: scale / 2 = 0.00333.

Quantize and Dequantize One Weight x = 0.10 real weight subtract min 1.10 divide scale 165.0 round/store q = 165 dequantize 0.10 The operation is a lossy ruler: store an integer, recover an approximation.
The operation is a lossy ruler: store an integer, recover an approximation.
Formats Are Not Interchangeable

Bytes, range, and precision are separate decisions.

FormatBytes/paramRange intuitionTypical inference role
FP324.0Huge range, high precisionTraining reference, CPU math
FP162.0Smaller range, good precisionCommon GPU inference weights
BF162.0FP32-like range, fewer mantissa bitsSafer training/inference activations
FP81.0Small precision, hardware-dependentH100/B200-style throughput path
INT81.0256 uniform levelsWeight-only inference, KV cache
INT40.516 levelsConsumer deployment, GGUF/GPTQ/AWQ
NF40.516 nonuniform normal levelsQLoRA-style Gaussian weights

A float format spends bits on sign, exponent, and mantissa. An integer format spends all bits on discrete levels inside a scale range. That is why INT4 can be excellent for frozen weights but awkward for activations whose range changes every batch. Weight-only quantization keeps activations in FP16/BF16 and stores weights as integers.

7B Model Weight Memory by Format FP32 28 GB FP16/BF16 14 GB INT8 7 GB INT4/NF4 3.5 GB Weight-only quantization changes the first-order memory budget immediately.
Weight-only quantization changes the first-order memory budget immediately.
Granularity Buys Accuracy

One scale is rarely enough.

Per-tensor quantization uses one scale for a whole matrix. If one row has an outlier, every row pays for that outlier with a coarse step. Per-channel quantization uses one scale per output row. Per-group quantization uses one scale per small chunk within a row.

Concrete split: if [0.10, -0.30, 0.70, -1.00, 0.50] shares one scale, the range is 1.70. If you split it into groups, each group gets a tighter ruler. Real LLM matrices have many rows with very different ranges, so per-channel and per-group are usually large wins.

Scale Granularity per tensor 16/16 active per channel 8/16 active per group 4/16 active Smaller groups use tighter ranges, so each integer step is more precise.
Smaller groups use tighter ranges, so each integer step is more precise.
GPTQ, AWQ, GGUF

The named methods protect different failure modes.

GPTQ is post-training quantization that treats quantization as an error-compensation problem. It estimates which columns matter more using approximate second-order information. You can think of the Hessian as a sensitivity table: if changing a weight causes loss to move a lot, quantize it carefully and compensate nearby weights.

AWQ starts from a serving observation: a small number of activation channels dominate output quality. It rescales important channels before quantization so those channels receive more effective resolution, then rescales after the linear layer. GGUF is the file format and quantization family used by llama.cpp; names like Q4_K_M encode group quantization choices tuned for local runtimes.

Method to Deployment Fit GPTQ offline 4-bit quality AWQ fast calibrated 4-bit GGUF llama.cpp local runtime FP8 new NVIDIA throughput INT8 KV long-context batch size The method is only useful when it matches the runtime and bottleneck.
The method is only useful when it matches the runtime and bottleneck.
KV Cache Quantization

Long context makes cache quantization matter.

Weights are not the only memory consumer. From Day 20, a LLaMA-2 7B-style FP16 KV cache at context 4096 is about 2 GB per batch item. INT8 KV halves that. FP8 KV can also halve it while mapping naturally to newer NVIDIA hardware.

The caution is that KV cache stores activations, not fixed weights. Its distribution changes with prompt, position, and layer. That means KV quantization usually needs per-head or per-token scaling and a careful quality check.

Weight-Only Linear Layer int weight q_w scale per row s dequantize q_w * s matmul x @ w FP16 output Activations stay high precision; only stored weights shrink.
Activations stay high precision; only stored weights shrink.
Decision Rule

Pick the scheme from the bottleneck.

Use FP16/BF16 when correctness is the baseline or the model is small enough. Use INT8 weight-only when you want an easy 2x memory cut with minimal quality risk. Use GPTQ or AWQ INT4 when weights dominate memory and you have calibrated the exact model. Use GGUF when the runtime is llama.cpp or Apple/CPU local inference. Use FP8 when the hardware and engine have first-class support.

Did You Know?

A detail worth remembering.

NF4 levels are not evenly spaced. They are chosen for normally distributed weights, which is why QLoRA can use 4-bit storage while keeping surprisingly strong fine-tuning quality.
Exercise

Build the habit with code.

  1. Implement per-tensor INT8 quantize/dequantize for a NumPy matrix and compute MSE.
  2. Switch to per-channel scales and verify MSE drops on rows with different ranges.
  3. Compute the weight memory for 7B and 13B models at FP16, INT8, and INT4.
  4. Optional hardware path: run a small GGUF or AWQ model and record tokens/sec, peak memory, and output quality notes.
Self-Check

Answer these from memory.

  1. Why does per-channel quantization usually beat per-tensor quantization? Each output row gets its own range, so one outlier row does not force coarse steps for every row.
  2. What does weight-only quantization leave in FP16/BF16? The activations.
  3. Why is INT4 harder than INT8? Sixteen levels leave little room for outliers, so grouping, calibration, and good kernels matter more.
  4. What problem does AWQ solve? It protects activation-important channels by scaling them before quantization and undoing that scale later.
  5. When is GGUF the practical answer? Local inference with llama.cpp, CPU, Apple Silicon, or consumer hardware where GGUF kernels are mature.

"Quantization is not smaller math; it is the same model stored with a carefully chosen ruler."

Day 22 · Week 4
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

GPTQ

Hessian-aware post-training quantization.

Open
Paper

AWQ

Activation-aware weight quantization.

Open
Paper

QLoRA / NF4

NormalFloat and double quantization.

Open
Repo

llama.cpp GGUF

The dominant local-inference format family.

Open
Notebook

Day 22 notebook

Runnable companion notebook for the lesson.

Open notebook