LLM Inference Engineer · Day 18
Day 18 · Week 3 · Inference & Hardware

CUDA for ML: cuBLAS, cuDNN, NCCL & CUDA Graphs

Most production inference code does not hand-write matmul. It calls libraries that already know the GPU. Today you learn what cuBLAS, cuDNN, NCCL, fused kernels, and CUDA Graphs are doing under the PyTorch calls.

Time~150 min
DifficultyHard
PrerequisiteDay 17
Notebookday-18
Why This Lesson

Hardware limits shape inference behavior.

Day 17 gave you the kernel model. Day 18 puts that model back inside the library stack. torch.matmul calls cuBLAS. scaled_dot_product_attention can call fused attention kernels. Distributed strategies lean on NCCL. Decode loops often lean on CUDA Graphs to remove launch overhead.

The practical skill is knowing when the library path is already good and when your shapes, launch pattern, or communication pattern have knocked you off the fast path.

Learning Objectives

What you should be able to do today.

  1. Explain why torch.matmul usually means cuBLAS GEMM on CUDA.
  2. Describe what a fused operation saves in memory traffic.
  3. Estimate kernel launch overhead in a decode loop.
  4. Sketch NCCL all-reduce and estimate ring communication time.
  5. Explain why CUDA Graph replay helps batch-1 decode more than prefill.
  6. Name the MLX equivalent: lazy graphs plus mx.compile.
Math Notation Cheatsheet

Decode the symbols before using them.

  • GEMM means general matrix multiply, usually C = alpha A B + beta C.
  • cuBLAS is NVIDIA's dense linear algebra library.
  • cuDNN is NVIDIA's deep learning primitive library; today it also includes attention fast paths.
  • NCCL is NVIDIA's collective communication library for multi-GPU work.
  • all-reduce means every GPU contributes a tensor and every GPU receives the reduced result.
  • cudaGraph_t is a captured CUDA work graph; cudaGraphExec_t is the executable replay object.
  • TF32 is a Tensor Core format with FP32 range and reduced mantissa precision.
cuBLAS and cuDNN

Use the library until you can prove it is the bottleneck.

Objective

By the end of this section, you should know what sits underneath common PyTorch calls.

When you write:

y = x @ w

on CUDA tensors, PyTorch generally dispatches to cuBLAS or cuBLASLt. You do not get extra credit for bypassing it unless you have an unusual layout, quantization scheme, fused epilogue, or batched-GEMM trick that the library path cannot express.

For attention, the library story is similar. PyTorch's torch.nn.functional.scaled_dot_product_attention can dispatch to FlashAttention-style kernels when dtype, mask, shape, and device allow it. If it cannot, it falls back to a math or memory-efficient backend. Day 21 explains the algorithm; today the point is dispatch awareness.

Fusion

Fused kernels save reads and writes.

Objective

By the end of this section, you should be able to count why fusion helps memory-bound work.

Consider a small decode path:

x -> RMSNorm -> matmul -> SwiGLU -> matmul -> residual

If each arrow is a separate kernel, intermediate tensors are written to HBM and then read back. If the operation is memory-bound, those extra round-trips are expensive. A fused kernel keeps intermediate values in registers or shared memory and writes the final result once.

This is the same roofline lesson again: fusion does not change the mathematical FLOPs much. It changes the bytes moved.

Fused Kernels Avoid HBM Round-Trips Unfused RMSNorm Matmul SwiGLU write/read/write Fused RMSNorm + matmul + activation in one kernel one read/write boundary
Fusion matters most when intermediate tensors would otherwise bounce through HBM.
Launch Overhead

Decode exposes every microsecond of CPU launch cost.

Objective

By the end of this section, you should be able to estimate launch overhead in a token budget.

Concrete budget:

model layers = 32
kernels per layer in decode = 8
launch overhead per kernel = 10 microseconds

overhead = 32 * 8 * 10 us
         = 2560 us
         = 2.56 ms per token

If your target is 50 tokens/sec, the budget is:

1 / 50 = 0.02 seconds = 20 ms per token

Launch overhead alone is about 2.56 / 20 = 12.8% of the token budget. CUDA Graphs capture the repeated launch sequence once and replay it with much lower CPU overhead.

Kernel Launch Overhead in Decode Eager launches gaps are CPU launch overhead CUDA Graph replay same logical kernels, one replay launch
CUDA Graphs help when the same small-shape decode work repeats many times.
NCCL

Distributed training and inference are communication problems too.

Objective

By the end of this section, you should be able to estimate a ring all-reduce.

For a ring all-reduce with p GPUs, message size M, and link bandwidth BW, a common first estimate is:

time ~= 2 * (p - 1) / p * M / BW

For p = 8, M = 1 GB, BW = 600 GB/s:

time ~= 2 * 7/8 * 1 / 600
     ~= 0.0029 s
     ~= 2.9 ms

This formula ignores latency and topology details, but it explains why NVLink matters and why cross-node communication is a different regime.

NCCL Ring All-Reduce GPU 0 GPU 1 GPU 2 GPU 3 Reduce-scatter moves chunks around the ring; all-gather sends the reduced chunks back.
NCCL collectives are the plumbing behind distributed training and tensor-parallel inference.
Did You Know?

A systems detail worth remembering.

CUDA Graphs are not an ML-specific feature. They are a general CUDA mechanism, but LLM decode made them newly important because decode repeats the same tiny launch pattern token after token.
Exercise

Do the arithmetic, then run the notebook.

Run the notebook in three modes:

  1. CPU or MPS: calculate launch overhead and all-reduce estimates; run small PyTorch matmul timing if available.
  2. CUDA eager: benchmark repeated small matmuls or one tiny Transformer layer.
  3. CUDA graph: capture the repeated work and replay it, then compare launch overhead.

On consumer NVIDIA hardware, the graph path should reduce CPU launch overhead noticeably for small repeated decode shapes.

Self-Check

Answer these from memory.

  1. When should you call cuBLAS directly? Rarely; PyTorch already does. Direct calls are for special layouts, batching tricks, or custom fused paths.
  2. What does fused mean? Multiple logical operations run in one kernel so intermediates avoid HBM round-trips.
  3. Why do CUDA Graphs help decode? Decode repeats many small kernels where CPU launch overhead is visible.
  4. What does NCCL all-reduce do? Combines tensors across GPUs and returns the result to every GPU.
  5. Why is prefill less sensitive to launch overhead? Prefill launches fewer, larger kernels with much more work per launch.

"The fastest CUDA code is often the code that lets the right library do the right thing."

Day 18 · Week 3
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Docs

cuBLAS

Dense linear algebra documentation, including GEMM variants and compute types.

Open
Docs

cuDNN

NVIDIA deep-learning primitive library and fused operation guide.

Open
Docs

NCCL

Collective communication reference for multi-GPU workloads.

Open
Blog

CUDA Graphs

NVIDIA's introduction to capture and replay for repeated launch patterns.

Open
Notebook · Day 18

CUDA for ML Libraries notebook

Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.

Open notebook