Day 16 - GPU Architecture, Roofline Model & Arithmetic Intensity

Why This Lesson

Hardware limits shape inference behavior.

Day 15 showed that prefill and decode are different runtime regimes. Day 16 explains why. A prompt prefill is a large dense computation with enough work to feed Tensor Cores. Batch-1 decode is often a stream of small matrix-vector operations that repeatedly read weights from memory.

The roofline model gives you a simple engineering test: count the FLOPs, count the bytes, divide. If the arithmetic intensity is far below the GPU ridge point, better math kernels will not save you; you need batching, fusion, caching, or less memory movement.

Learning Objectives

What you should be able to do today.

Sketch a GPU memory hierarchy from registers to HBM and explain what changes at each level.
Compute the ridge point as peak FLOP/s / memory bandwidth.
Calculate arithmetic intensity in FLOPs/byte for decode attention, prefill attention, and FFN layers.
Place operations on a roofline plot and predict memory-bound versus compute-bound behavior.
Explain how batching and kernel fusion move an operation toward the useful part of the roofline.
Connect Tensor Cores to the matmul shapes from Day 1 and the Transformer block from Day 7.

Math Notation Cheatsheet

Decode the symbols before using them.

FLOP means one floating-point operation, usually one multiply or one add.
BW means bandwidth: bytes moved per second from memory.
AI means arithmetic intensity: FLOPs / bytes moved.
peak FLOP/s is the fastest arithmetic throughput the GPU advertises for a dtype.
ridge point is peak FLOP/s / BW. To the left, bandwidth limits you. To the right, compute limits you.
SM means streaming multiprocessor, the unit that schedules warps and owns registers/shared memory.
HBM is high-bandwidth memory, the large memory pool on data-center GPUs.

Memory Hierarchy

A GPU is fast only when data is close enough to the math.

Objective

By the end of this section, you should be able to explain why a kernel can have plenty of theoretical FLOPs and still run slowly.

Start concrete. Suppose a single multiply-add needs two numbers. If those numbers are already in registers, the GPU can keep arithmetic units busy. If every multiply-add waits on HBM, the arithmetic units sit idle. This is the same lesson as Day 15 decode: batch-1 decode does a small amount of useful work per byte of weights loaded.

Think of the memory hierarchy as a pyramid:

Registers: private to a thread, tiny, fastest.
Shared memory / L1: per-SM SRAM, explicitly reused by threads in a block.
L2 cache: shared across SMs.
HBM: huge, high bandwidth, still much slower than on-chip memory.

The inference engineer's job is usually not to make the GPU "do more math." It is to arrange the program so the same bytes are reused before they leave the fast levels.

The GPU memory hierarchy. Tiling and fusion are ways of keeping data in the upper levels longer.

Roofline

The roofline tells you which limit you are hitting.

Objective

By the end of this section, you should be able to compute a ridge point and classify an operation.

Use an A100 FP16 dense reference:

peak compute = 312 TFLOP/s = 312e12 FLOP/s
HBM bandwidth = 2 TB/s = 2e12 bytes/s

ridge = peak compute / bandwidth
      = 312e12 / 2e12
      = 156 FLOPs/byte

Decoded: if an operation does less than about 156 FLOPs for every byte it reads or writes, the A100 cannot reach peak FP16 throughput. The memory system is the limiter. If the operation does much more than 156 FLOPs/byte, compute becomes the limiter.

Now repeat for an H100 dense FP16-style number:

peak compute ~= 990 TFLOP/s
HBM bandwidth ~= 3.35 TB/s
ridge ~= 990e12 / 3.35e12
ridge ~= 295 FLOPs/byte

The H100 has much more compute relative to bandwidth, so the ridge moves right. Some operations that are merely memory-bound on A100 are extremely memory-bound on H100. The current Blackwell B200 pushes this further still: ~2.25 PFLOP/s dense BF16 over ~8 TB/s HBM3e puts its ridge near 280 FLOPs/byte — even more compute-rich relative to bandwidth.

A roofline is a first-order model, not a profiler. It tells you what kind of optimization has a chance.

Arithmetic Intensity

Decode has tiny AI. Prefill has large AI.

Objective

By the end of this section, you should be able to estimate arithmetic intensity with pencil-and-paper math.

Single-token decode attention, simplified to one dot product against a key row of width d = 4096:

FLOPs ~= 2 * d = 8192
bytes ~= read K and Q in FP16 ~= 2 * d * 2 bytes = 16384 bytes
AI ~= 8192 / 16384 = 0.5 FLOPs/byte

That is nowhere near the A100 ridge of 156. It is memory-bound.

Now a prefill-style matmul has much more reuse. A rough attention prefill estimate at long sequence length gives AI on the order of d / 4. With d = 4096:

AI ~= 4096 / 4 = 1024 FLOPs/byte

That is right of the A100 ridge. It can be compute-bound if the implementation uses Tensor Cores well.

For an FFN matrix multiply at batch size B, a useful first estimate is:

AI ~= B / 2

At B = 1, AI is about 0.5. At B = 64, AI is about 32. Batching helps because one weight read serves many sequences, but even batch 64 can still be left of a modern GPU ridge.

Tiled matmul is the practical bridge from roofline math to kernel design.

Hardware Note

Tensor Cores reward aligned shapes.

Tensor Cores are specialized matrix-multiply units. PyTorch, cuBLAS, cuDNN, and MLX will use them automatically when dtype and shape line up. The important engineering habit is to keep dimensions multiples of 8, 16, 64, or 128 depending on the kernel and dtype.

You do not need to write Tensor Core assembly for this course. You do need to know when your shape prevents a library from using the fast path. This is why production models often choose widths and head dimensions that look suspiciously round.

Exercise

Do the arithmetic, then run the notebook.

Compute arithmetic intensity for three operations and classify them on an A100 with ridge 156 FLOPs/byte:

Single-token decode attention with d = 4096.
FFN matmul at B = 1, d = 4096, hidden size 16384.
The same FFN at B = 64.

Then run the companion notebook. It calculates the same values and includes an optional PyTorch timing cell. On a CUDA machine, replace CPU timers with torch.cuda.Event and compare measured throughput with the roofline prediction.

Self-Check

Answer these from memory.

What is the ridge point? peak FLOP/s / memory bandwidth; it is the AI where the roofline bends.
Why is batch-1 decode memory-bound? Each token rereads a large amount of model state for little work, so FLOPs per byte is low.
How does batching help decode? The same weight bytes feed multiple sequences, increasing arithmetic intensity.
What optimization helps a memory-bound op? Reduce bytes: fuse kernels, tile, cache, quantize, or batch.
What optimization helps a compute-bound op? Improve math utilization: Tensor Cores, better tiling, fewer non-matmul operations.

Go deeper.

Primary references and the companion notebook for today's exercise.

Paper

Roofline model

The original Berkeley report introducing the roofline performance model.

Open

Whitepaper

NVIDIA A100 architecture

Reference numbers for SMs, memory bandwidth, Tensor Cores, and sparsity.

Open

Article

How to Optimize a CUDA GEMM

A practical kernel-by-kernel path from naive matmul to high-performance GEMM.

Open

Notebook · Day 16

GPU Architecture & Roofline notebook

Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.

Open notebook

GPU Architecture, Roofline Model & Arithmetic Intensity

Hardware limits shape inference behavior.

What you should be able to do today.

Decode the symbols before using them.

A GPU is fast only when data is close enough to the math.

Objective

The roofline tells you which limit you are hitting.

Objective

Decode has tiny AI. Prefill has large AI.

Objective

Tensor Cores reward aligned shapes.

A systems detail worth remembering.

Do the arithmetic, then run the notebook.

Answer these from memory.

Go deeper.

Roofline model

NVIDIA A100 architecture

How to Optimize a CUDA GEMM

GPU Architecture & Roofline notebook