Most production inference code does not hand-write matmul. It calls libraries that already know the GPU. Today you learn what cuBLAS, cuDNN, NCCL, fused kernels, and CUDA Graphs are doing under the PyTorch calls.
Day 17 gave you the kernel model. Day 18 puts that model back inside the library stack. torch.matmul calls cuBLAS. scaled_dot_product_attention can call fused attention kernels. Distributed strategies lean on NCCL. Decode loops often lean on CUDA Graphs to remove launch overhead.
The practical skill is knowing when the library path is already good and when your shapes, launch pattern, or communication pattern have knocked you off the fast path.
torch.matmul usually means cuBLAS GEMM on CUDA.mx.compile.GEMM means general matrix multiply, usually C = alpha A B + beta C.cuBLAS is NVIDIA's dense linear algebra library.cuDNN is NVIDIA's deep learning primitive library; today it also includes attention fast paths.NCCL is NVIDIA's collective communication library for multi-GPU work.all-reduce means every GPU contributes a tensor and every GPU receives the reduced result.cudaGraph_t is a captured CUDA work graph; cudaGraphExec_t is the executable replay object.TF32 is a Tensor Core format with FP32 range and reduced mantissa precision.By the end of this section, you should know what sits underneath common PyTorch calls.
When you write:
y = x @ w
on CUDA tensors, PyTorch generally dispatches to cuBLAS or cuBLASLt. You do not get extra credit for bypassing it unless you have an unusual layout, quantization scheme, fused epilogue, or batched-GEMM trick that the library path cannot express.
For attention, the library story is similar. PyTorch's torch.nn.functional.scaled_dot_product_attention can dispatch to FlashAttention-style kernels when dtype, mask, shape, and device allow it. If it cannot, it falls back to a math or memory-efficient backend. Day 21 explains the algorithm; today the point is dispatch awareness.
By the end of this section, you should be able to count why fusion helps memory-bound work.
Consider a small decode path:
x -> RMSNorm -> matmul -> SwiGLU -> matmul -> residual
If each arrow is a separate kernel, intermediate tensors are written to HBM and then read back. If the operation is memory-bound, those extra round-trips are expensive. A fused kernel keeps intermediate values in registers or shared memory and writes the final result once.
This is the same roofline lesson again: fusion does not change the mathematical FLOPs much. It changes the bytes moved.
By the end of this section, you should be able to estimate launch overhead in a token budget.
Concrete budget:
model layers = 32
kernels per layer in decode = 8
launch overhead per kernel = 10 microseconds
overhead = 32 * 8 * 10 us
= 2560 us
= 2.56 ms per token
If your target is 50 tokens/sec, the budget is:
1 / 50 = 0.02 seconds = 20 ms per token
Launch overhead alone is about 2.56 / 20 = 12.8% of the token budget. CUDA Graphs capture the repeated launch sequence once and replay it with much lower CPU overhead.
By the end of this section, you should be able to estimate a ring all-reduce.
For a ring all-reduce with p GPUs, message size M, and link bandwidth BW, a common first estimate is:
time ~= 2 * (p - 1) / p * M / BW
For p = 8, M = 1 GB, BW = 600 GB/s:
time ~= 2 * 7/8 * 1 / 600
~= 0.0029 s
~= 2.9 ms
This formula ignores latency and topology details, but it explains why NVLink matters and why cross-node communication is a different regime.
Run the notebook in three modes:
On consumer NVIDIA hardware, the graph path should reduce CPU launch overhead noticeably for small repeated decode shapes.
"The fastest CUDA code is often the code that lets the right library do the right thing."
Primary references and the companion notebook for today's exercise.
Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.
Open notebook