LLM Inference Engineer · Day 17
Day 17 · Week 3 · Inference & Hardware

CUDA Programming Basics: Write Your First Kernel

Today you leave library calls for one level and write kernels directly. The goal is not to outdo cuBLAS. The goal is to read CUDA code without fear: grids, blocks, threads, coalescing, shared memory, and tiled matmul.

Time~180 min
DifficultyHard
PrerequisiteDay 16
Notebookday-17
Why This Lesson

Hardware limits shape inference behavior.

Production inference engines rely on custom kernels. FlashAttention, paged KV cache updates, fused RMSNorm, rotary embedding, and quantized matmul all live below Python. You do not need to be a full-time kernel engineer, but you do need to recognize the programming model.

This lesson starts with vector add because the indexing is transparent. It then moves to naive matmul and tiled matmul, the smallest useful example of trading global-memory traffic for shared-memory reuse.

Learning Objectives

What you should be able to do today.

  1. Compute a global thread index from blockIdx, blockDim, and threadIdx.
  2. Explain host memory, device memory, and the host-to-device launch sequence.
  3. Write and bound-check a vector-add CUDA kernel.
  4. Distinguish coalesced from strided memory access.
  5. Implement naive matmul conceptually and explain why it rereads data.
  6. Explain how shared-memory tiling reduces global memory traffic.
Math Notation Cheatsheet

Decode the symbols before using them.

  • __global__ marks a CUDA function callable from the CPU and executed on the GPU.
  • blockIdx.x is the block's x-coordinate inside the launched grid.
  • threadIdx.x is the thread's x-coordinate inside its block.
  • blockDim.x is the number of threads in one block along x.
  • kernel<<<grid, block>>>(args) is CUDA launch syntax.
  • __shared__ declares per-block SRAM.
  • __syncthreads() waits until every thread in the block reaches the barrier.
Programming Model

A kernel is one function run by many threads.

Objective

By the end of this section, you should be able to map a 1-D array index to a CUDA thread.

Start with N = 1024 numbers and 256 threads per block. You need:

blocks = ceil(1024 / 256) = 4
threads_per_block = 256
total launched threads = 4 * 256 = 1024

Inside the kernel, each thread calculates:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

Concrete thread examples:

blockIdx.x=0, threadIdx.x=0   -> idx = 0 * 256 + 0   = 0
blockIdx.x=0, threadIdx.x=255 -> idx = 0 * 256 + 255 = 255
blockIdx.x=1, threadIdx.x=0   -> idx = 1 * 256 + 0   = 256
blockIdx.x=3, threadIdx.x=10  -> idx = 3 * 256 + 10  = 778

Always bound-check with if (idx < N). Real launches are often rounded up, and extra threads must do nothing.

Grid -> Blocks -> Threads blockIdx.x = 00123blockIdx.x = 10123blockIdx.x = 20123blockIdx.x = 30123 idx = blockIdx.x * blockDim.x + threadIdx.x
The grid is made of blocks; each block is made of threads. The global index formula is the first CUDA habit.
Memory Access

Consecutive threads should read consecutive addresses.

Objective

By the end of this section, you should be able to spot a coalescing bug.

A warp is 32 NVIDIA threads that execute together. If thread 0 reads a[0], thread 1 reads a[1], and so on, the memory system can combine the warp's loads into wide transactions.

If thread 0 reads a[0], thread 1 reads a[2], thread 2 reads a[4], the warp touches a wider address span. The same arithmetic now needs more memory transactions. Day 16 would describe this as lower effective bandwidth.

Coalescing is why data layout matters. A logically equivalent layout can be much slower if adjacent threads walk memory with a stride.

Coalesced Access Keeps the Bus Full Coalesced: thread 0 reads address 0, thread 1 reads address 1, ... Strided: thread 0 reads 0, thread 1 reads 2, thread 2 reads 4, ... 01234567891011121314150123456789101112131415 one contiguous transaction more transactions, lower bandwidth
Coalesced access is a bandwidth optimization before it is a coding style preference.
Matmul

Naive matmul is correct, tiled matmul is reusable.

Objective

By the end of this section, you should be able to explain why shared memory helps matmul.

Naive matmul assigns one output element C[row, col] to one thread:

float acc = 0.0f;
for (int k = 0; k < K; k++) {
    acc += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = acc;

This is exactly the Day 1 dot product repeated for every output cell. It is also wasteful: nearby threads reread overlapping rows and columns from global memory.

Tiled matmul changes the data movement:

  1. A block cooperatively loads a tile of A and a tile of B from HBM into shared memory.
  2. Threads multiply those tiles many times while the data is on-chip.
  3. The block moves to the next K tile.
  4. The final output tile is written once.

The math is the same. The memory traffic is not.

Tiled Matmul: Move Less, Reuse More A tile: BM x BK B tile: BK x BN C tile accumulates x = Tiles are loaded from HBM once, reused many times from shared memory.
Tiling converts repeated HBM reads into shared-memory reuse.
Apple Path

What MLX hides from you.

MLX does not ask you to allocate device memory or write cudaMemcpy calls. Apple Silicon uses unified memory, and MLX builds lazy computation graphs that lower to Metal kernels. That is convenient, but the same principles still apply: coalesced access, on-chip reuse, and shape alignment decide whether the hardware is fed well.

If you are on Apple Silicon only, still read this lesson. Then run the notebook's non-CUDA simulations and revisit Day 19 for the MLX-native path.

Did You Know?

A systems detail worth remembering.

The NVIDIA warp size has been 32 threads since the original Tesla architecture. Warp size is why so many CUDA performance rules talk about groups of 32.
Exercise

Do the arithmetic, then run the notebook.

On an NVIDIA machine:

  1. Compile and run the vector-add kernel from the notebook.
  2. Time host-to-device copy, kernel time, and device-to-host copy separately.
  3. Implement naive matmul for M = N = K = 1024.
  4. Implement a tiled shared-memory matmul and compare speed.

On Apple Silicon or CPU-only machines, run the notebook's indexing, coalescing, and traffic estimators. The conceptual checks still execute without CUDA.

Self-Check

Answer these from memory.

  1. What is threadIdx.x? The thread's index inside its block along the x dimension.
  2. Why do kernels need bounds checks? Launches are often rounded up; extra threads must avoid out-of-range memory.
  3. What is coalesced access? Adjacent threads read or write adjacent addresses so the warp uses fewer transactions.
  4. What is shared memory? Fast per-block SRAM explicitly managed by the kernel.
  5. Why does tiled matmul help? A and B tiles are loaded from HBM once and reused many times on-chip.

"A CUDA kernel is just a function plus a mapping from threads to data."

Day 17 · Week 3
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Docs

CUDA C++ Programming Guide

The canonical source for CUDA syntax, memory hierarchy, and execution model.

Open
Book

Programming Massively Parallel Processors

The standard practical text for CUDA fundamentals.

Open
Article

How to Optimize a CUDA GEMM

A readable progression from naive to high-performance matmul.

Open
Notebook · Day 17

CUDA Programming Basics notebook

Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.

Open notebook