Apple Silicon is not a small NVIDIA GPU. It has a different memory model, different software stack, and a surprisingly strong batch-1 inference profile. Today you learn unified memory, MLX's lazy model, PyTorch MPS, Metal, and when to choose MLX-LM or llama.cpp.
The course treats NVIDIA and Apple Silicon as first-class targets because inference engineers often use both: NVIDIA for throughput and custom CUDA kernels, Apple Silicon for local iteration and large unified-memory capacity.
The key mental shift is memory. A discrete GPU has CPU RAM and GPU VRAM separated by PCIe or NVLink. Apple Silicon uses one physical memory pool shared by CPU and GPU. This removes explicit copies and makes model capacity unusually attractive.
.to('cuda')-style copies.mx.eval, mx.compile, mx.vmap, and mx.grad conceptually.mx.array is the MLX array type, similar to a NumPy array.mx.eval(x) forces queued MLX work to execute and materialize.mx.compile(f) JIT-compiles a function for repeated execution.mx.vmap(f) vectorizes a function over a batch dimension.mx.grad(f) returns a gradient function.MPS means Metal Performance Shaders, PyTorch's Apple GPU backend.GGUF is the llama.cpp model file format used for quantized local inference.By the end of this section, you should be able to explain why a Mac can run models that do not fit on many discrete GPUs.
Concrete model memory:
70B parameters at 4 bits
= 70e9 * 0.5 bytes
= 35e9 bytes
~= 35 GB
That does not include metadata, activations, or KV cache, but it explains the appeal: a 64 GB or 128 GB unified-memory Mac can hold the weights in one machine. A 40 GB discrete GPU may not have enough VRAM after overhead.
Decode throughput is still bandwidth-bound. If a 35 GB quantized model is reread each token and memory bandwidth is 800 GB/s:
upper bound ~= 800 / 35
~= 22.9 tokens/sec
That is a bandwidth roofline estimate, not a promise. It is still the right first model.
By the end of this section, you should understand why mx.eval() exists.
PyTorch eager execution runs each operation as Python asks for it. MLX records operations lazily and executes when results are needed. That lets MLX fuse and compile work more aggressively.
Minimal MLX mental model:
import mlx.core as mx
x = mx.array([1.0, 2.0, 3.0])
y = x * 2 + 1 # queued
mx.eval(y) # materialized
Function transforms wrap functions rather than modules:
compiled_f = mx.compile(f)
batched_f = mx.vmap(f)
grad_f = mx.grad(loss_fn)
You saw the same mathematical idea in Day 2 autograd: build a computation, then differentiate or execute it. MLX just makes transforms a first-class part of the API.
By the end of this section, you should know how to choose the Apple backend.
PyTorch MPS is useful when you want one PyTorch codebase to run on Mac:
device = "mps" if torch.backends.mps.is_available() else "cpu"
x = x.to(device)
The caveat is coverage. Some operations may fall back to CPU or have different performance behavior than CUDA. This matters for benchmarks: a silent fallback can make a GPU path look mysteriously slow.
MLX is usually the better fit for Mac-native LLM inference. llama.cpp is often the best fit for GGUF quantized local serving, especially when you want broad model compatibility and a mature CLI/server.
Use this practical rule:
Day 20 and Day 21 still apply on Apple Silicon: KV cache memory and attention IO are hardware-agnostic constraints. Only the implementation path changes.
Run the notebook:
For a real hardware exercise, install MLX-LM and llama.cpp separately, run the same prompt through one 4-bit model, and record tokens/sec plus memory used.
tensor.to('cuda')? There is no separate discrete VRAM copy target in the same way."On Apple Silicon, local LLM inference is often a memory-capacity story first and a compute story second."
Primary references and the companion notebook for today's exercise.
Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.
Open notebook