A GPU is not just a faster CPU. It is a machine built to do enormous amounts of arithmetic only when data arrives in the right shape. Today you learn the memory hierarchy, compute the roofline ridge point, and decide whether an inference operation is compute-bound or memory-bound before reaching for an optimization.
Day 15 showed that prefill and decode are different runtime regimes. Day 16 explains why. A prompt prefill is a large dense computation with enough work to feed Tensor Cores. Batch-1 decode is often a stream of small matrix-vector operations that repeatedly read weights from memory.
The roofline model gives you a simple engineering test: count the FLOPs, count the bytes, divide. If the arithmetic intensity is far below the GPU ridge point, better math kernels will not save you; you need batching, fusion, caching, or less memory movement.
peak FLOP/s / memory bandwidth.FLOP means one floating-point operation, usually one multiply or one add.BW means bandwidth: bytes moved per second from memory.AI means arithmetic intensity: FLOPs / bytes moved.peak FLOP/s is the fastest arithmetic throughput the GPU advertises for a dtype.ridge point is peak FLOP/s / BW. To the left, bandwidth limits you. To the right, compute limits you.SM means streaming multiprocessor, the unit that schedules warps and owns registers/shared memory.HBM is high-bandwidth memory, the large memory pool on data-center GPUs.By the end of this section, you should be able to explain why a kernel can have plenty of theoretical FLOPs and still run slowly.
Start concrete. Suppose a single multiply-add needs two numbers. If those numbers are already in registers, the GPU can keep arithmetic units busy. If every multiply-add waits on HBM, the arithmetic units sit idle. This is the same lesson as Day 15 decode: batch-1 decode does a small amount of useful work per byte of weights loaded.
Think of the memory hierarchy as a pyramid:
The inference engineer's job is usually not to make the GPU "do more math." It is to arrange the program so the same bytes are reused before they leave the fast levels.
By the end of this section, you should be able to compute a ridge point and classify an operation.
Use an A100 FP16 dense reference:
peak compute = 312 TFLOP/s = 312e12 FLOP/s
HBM bandwidth = 2 TB/s = 2e12 bytes/s
ridge = peak compute / bandwidth
= 312e12 / 2e12
= 156 FLOPs/byte
Decoded: if an operation does less than about 156 FLOPs for every byte it reads or writes, the A100 cannot reach peak FP16 throughput. The memory system is the limiter. If the operation does much more than 156 FLOPs/byte, compute becomes the limiter.
Now repeat for an H100 dense FP16-style number:
peak compute ~= 990 TFLOP/s
HBM bandwidth ~= 3.35 TB/s
ridge ~= 990e12 / 3.35e12
ridge ~= 295 FLOPs/byte
The H100 has much more compute relative to bandwidth, so the ridge moves right. Some operations that are merely memory-bound on A100 are extremely memory-bound on H100. The current Blackwell B200 pushes this further still: ~2.25 PFLOP/s dense BF16 over ~8 TB/s HBM3e puts its ridge near 280 FLOPs/byte — even more compute-rich relative to bandwidth.
By the end of this section, you should be able to estimate arithmetic intensity with pencil-and-paper math.
Single-token decode attention, simplified to one dot product against a key row of width d = 4096:
FLOPs ~= 2 * d = 8192
bytes ~= read K and Q in FP16 ~= 2 * d * 2 bytes = 16384 bytes
AI ~= 8192 / 16384 = 0.5 FLOPs/byte
That is nowhere near the A100 ridge of 156. It is memory-bound.
Now a prefill-style matmul has much more reuse. A rough attention prefill estimate at long sequence length gives AI on the order of d / 4. With d = 4096:
AI ~= 4096 / 4 = 1024 FLOPs/byte
That is right of the A100 ridge. It can be compute-bound if the implementation uses Tensor Cores well.
For an FFN matrix multiply at batch size B, a useful first estimate is:
AI ~= B / 2
At B = 1, AI is about 0.5. At B = 64, AI is about 32. Batching helps because one weight read serves many sequences, but even batch 64 can still be left of a modern GPU ridge.
Tensor Cores are specialized matrix-multiply units. PyTorch, cuBLAS, cuDNN, and MLX will use them automatically when dtype and shape line up. The important engineering habit is to keep dimensions multiples of 8, 16, 64, or 128 depending on the kernel and dtype.
You do not need to write Tensor Core assembly for this course. You do need to know when your shape prevents a library from using the fast path. This is why production models often choose widths and head dimensions that look suspiciously round.
Compute arithmetic intensity for three operations and classify them on an A100 with ridge 156 FLOPs/byte:
d = 4096.B = 1, d = 4096, hidden size 16384.B = 64.Then run the companion notebook. It calculates the same values and includes an optional PyTorch timing cell. On a CUDA machine, replace CPU timers with torch.cuda.Event and compare measured throughput with the roofline prediction.
peak FLOP/s / memory bandwidth; it is the AI where the roofline bends."A roofline estimate will not replace profiling, but it will keep you from optimizing the wrong resource."
Primary references and the companion notebook for today's exercise.
Reference numbers for SMs, memory bandwidth, Tensor Cores, and sparsity.
OpenA practical kernel-by-kernel path from naive matmul to high-performance GEMM.
OpenCompanion Jupyter notebook with runnable calculations and optional hardware-specific cells.
Open notebook