Today you leave library calls for one level and write kernels directly. The goal is not to outdo cuBLAS. The goal is to read CUDA code without fear: grids, blocks, threads, coalescing, shared memory, and tiled matmul.
Production inference engines rely on custom kernels. FlashAttention, paged KV cache updates, fused RMSNorm, rotary embedding, and quantized matmul all live below Python. You do not need to be a full-time kernel engineer, but you do need to recognize the programming model.
This lesson starts with vector add because the indexing is transparent. It then moves to naive matmul and tiled matmul, the smallest useful example of trading global-memory traffic for shared-memory reuse.
blockIdx, blockDim, and threadIdx.__global__ marks a CUDA function callable from the CPU and executed on the GPU.blockIdx.x is the block's x-coordinate inside the launched grid.threadIdx.x is the thread's x-coordinate inside its block.blockDim.x is the number of threads in one block along x.kernel<<<grid, block>>>(args) is CUDA launch syntax.__shared__ declares per-block SRAM.__syncthreads() waits until every thread in the block reaches the barrier.By the end of this section, you should be able to map a 1-D array index to a CUDA thread.
Start with N = 1024 numbers and 256 threads per block. You need:
blocks = ceil(1024 / 256) = 4
threads_per_block = 256
total launched threads = 4 * 256 = 1024
Inside the kernel, each thread calculates:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
Concrete thread examples:
blockIdx.x=0, threadIdx.x=0 -> idx = 0 * 256 + 0 = 0
blockIdx.x=0, threadIdx.x=255 -> idx = 0 * 256 + 255 = 255
blockIdx.x=1, threadIdx.x=0 -> idx = 1 * 256 + 0 = 256
blockIdx.x=3, threadIdx.x=10 -> idx = 3 * 256 + 10 = 778
Always bound-check with if (idx < N). Real launches are often rounded up, and extra threads must do nothing.
By the end of this section, you should be able to spot a coalescing bug.
A warp is 32 NVIDIA threads that execute together. If thread 0 reads a[0], thread 1 reads a[1], and so on, the memory system can combine the warp's loads into wide transactions.
If thread 0 reads a[0], thread 1 reads a[2], thread 2 reads a[4], the warp touches a wider address span. The same arithmetic now needs more memory transactions. Day 16 would describe this as lower effective bandwidth.
Coalescing is why data layout matters. A logically equivalent layout can be much slower if adjacent threads walk memory with a stride.
By the end of this section, you should be able to explain why shared memory helps matmul.
Naive matmul assigns one output element C[row, col] to one thread:
float acc = 0.0f;
for (int k = 0; k < K; k++) {
acc += A[row * K + k] * B[k * N + col];
}
C[row * N + col] = acc;
This is exactly the Day 1 dot product repeated for every output cell. It is also wasteful: nearby threads reread overlapping rows and columns from global memory.
Tiled matmul changes the data movement:
A and a tile of B from HBM into shared memory.K tile.The math is the same. The memory traffic is not.
MLX does not ask you to allocate device memory or write cudaMemcpy calls. Apple Silicon uses unified memory, and MLX builds lazy computation graphs that lower to Metal kernels. That is convenient, but the same principles still apply: coalesced access, on-chip reuse, and shape alignment decide whether the hardware is fed well.
If you are on Apple Silicon only, still read this lesson. Then run the notebook's non-CUDA simulations and revisit Day 19 for the MLX-native path.
On an NVIDIA machine:
M = N = K = 1024.On Apple Silicon or CPU-only machines, run the notebook's indexing, coalescing, and traffic estimators. The conceptual checks still execute without CUDA.
threadIdx.x? The thread's index inside its block along the x dimension."A CUDA kernel is just a function plus a mapping from threads to data."
Primary references and the companion notebook for today's exercise.
The canonical source for CUDA syntax, memory hierarchy, and execution model.
OpenCompanion Jupyter notebook with runnable calculations and optional hardware-specific cells.
Open notebook