LLM Inference Engineer · Day 19
Day 19 · Week 3 · Inference & Hardware

Apple Silicon: MLX, Metal, MPS & Unified Memory

Apple Silicon is not a small NVIDIA GPU. It has a different memory model, different software stack, and a surprisingly strong batch-1 inference profile. Today you learn unified memory, MLX's lazy model, PyTorch MPS, Metal, and when to choose MLX-LM or llama.cpp.

Time~150 min
DifficultyMedium
PrerequisiteDay 18
Notebookday-19
Why This Lesson

Hardware limits shape inference behavior.

The course treats NVIDIA and Apple Silicon as first-class targets because inference engineers often use both: NVIDIA for throughput and custom CUDA kernels, Apple Silicon for local iteration and large unified-memory capacity.

The key mental shift is memory. A discrete GPU has CPU RAM and GPU VRAM separated by PCIe or NVLink. Apple Silicon uses one physical memory pool shared by CPU and GPU. This removes explicit copies and makes model capacity unusually attractive.

Learning Objectives

What you should be able to do today.

  1. Explain unified memory and why MLX does not use .to('cuda')-style copies.
  2. Contrast MLX lazy evaluation with PyTorch eager execution.
  3. Use mx.eval, mx.compile, mx.vmap, and mx.grad conceptually.
  4. State when PyTorch MPS is convenient and when MLX is likely better.
  5. Estimate model footprint for FP16 and 4-bit local inference.
  6. Choose between MLX-LM and llama.cpp for a local Mac workload.
Math Notation Cheatsheet

Decode the symbols before using them.

  • mx.array is the MLX array type, similar to a NumPy array.
  • mx.eval(x) forces queued MLX work to execute and materialize.
  • mx.compile(f) JIT-compiles a function for repeated execution.
  • mx.vmap(f) vectorizes a function over a batch dimension.
  • mx.grad(f) returns a gradient function.
  • MPS means Metal Performance Shaders, PyTorch's Apple GPU backend.
  • GGUF is the llama.cpp model file format used for quantized local inference.
Unified Memory

Capacity is the Apple Silicon superpower.

Objective

By the end of this section, you should be able to explain why a Mac can run models that do not fit on many discrete GPUs.

Concrete model memory:

70B parameters at 4 bits
= 70e9 * 0.5 bytes
= 35e9 bytes
~= 35 GB

That does not include metadata, activations, or KV cache, but it explains the appeal: a 64 GB or 128 GB unified-memory Mac can hold the weights in one machine. A 40 GB discrete GPU may not have enough VRAM after overhead.

Decode throughput is still bandwidth-bound. If a 35 GB quantized model is reread each token and memory bandwidth is 800 GB/s:

upper bound ~= 800 / 35
            ~= 22.9 tokens/sec

That is a bandwidth roofline estimate, not a promise. It is still the right first model.

Discrete VRAM vs Unified Memory CPU RAM GPU VRAM PCIe copy Apple unified memory pool: CPU and GPU share physical RAM
Unified memory removes explicit CPU-to-GPU copies, but bandwidth still bounds decode.
MLX

MLX is NumPy-like at the surface and JAX-like underneath.

Objective

By the end of this section, you should understand why mx.eval() exists.

PyTorch eager execution runs each operation as Python asks for it. MLX records operations lazily and executes when results are needed. That lets MLX fuse and compile work more aggressively.

Minimal MLX mental model:

import mlx.core as mx

x = mx.array([1.0, 2.0, 3.0])
y = x * 2 + 1      # queued
mx.eval(y)         # materialized

Function transforms wrap functions rather than modules:

compiled_f = mx.compile(f)
batched_f = mx.vmap(f)
grad_f = mx.grad(loss_fn)

You saw the same mathematical idea in Day 2 autograd: build a computation, then differentiate or execute it. MLX just makes transforms a first-class part of the API.

Eager Execution vs Lazy Evaluation PyTorch eager op 1 op 2 op 3 runs immediately MLX lazy build graph: op1 -> op2 -> op3 mx.eval()
Lazy evaluation is why MLX code can look NumPy-like while still compiling efficient Metal kernels.
Function Transforms in MLX f(x)plain function mx.grad(f)differentiate mx.vmap(f)batch it mx.compile(f)fuse/JIT it MLX feels closer to JAX than to PyTorch: transforms wrap functions.
MLX transforms compose around Python functions.
PyTorch MPS

MPS is convenient, but watch for fallback.

Objective

By the end of this section, you should know how to choose the Apple backend.

PyTorch MPS is useful when you want one PyTorch codebase to run on Mac:

device = "mps" if torch.backends.mps.is_available() else "cpu"
x = x.to(device)

The caveat is coverage. Some operations may fall back to CPU or have different performance behavior than CUDA. This matters for benchmarks: a silent fallback can make a GPU path look mysteriously slow.

MLX is usually the better fit for Mac-native LLM inference. llama.cpp is often the best fit for GGUF quantized local serving, especially when you want broad model compatibility and a mature CLI/server.

Local Inference Choice

Pick the stack by artifact and workload.

Use this practical rule:

  • Choose MLX-LM when you want Apple's native Python path, model conversion, and fast local experimentation.
  • Choose llama.cpp when you want GGUF models, broad quantization support, CPU+Metal portability, and a battle-tested local server.
  • Choose PyTorch MPS when you are prototyping PyTorch code that should also run on CUDA later.

Day 20 and Day 21 still apply on Apple Silicon: KV cache memory and attention IO are hardware-agnostic constraints. Only the implementation path changes.

Did You Know?

A systems detail worth remembering.

Apple's unified memory changes the capacity story more than the compute story. A large Mac can hold very large quantized models locally, but decode throughput is still governed by bandwidth and cache behavior.
Exercise

Do the arithmetic, then run the notebook.

Run the notebook:

  1. Compute FP16 and 4-bit model footprints for 7B, 13B, 70B, and 405B parameter counts.
  2. Execute the NumPy Transformer-block reference.
  3. If PyTorch is installed, run the same block on CPU or MPS.
  4. If MLX is installed, run the MLX setup cell and compare the API shape.

For a real hardware exercise, install MLX-LM and llama.cpp separately, run the same prompt through one 4-bit model, and record tokens/sec plus memory used.

Self-Check

Answer these from memory.

  1. What is unified memory? CPU and GPU share the same physical memory pool.
  2. Why does MLX not need tensor.to('cuda')? There is no separate discrete VRAM copy target in the same way.
  3. What does lazy evaluation mean? Operations are queued into a graph and executed when materialized.
  4. When use llama.cpp? For GGUF quantized local inference and a mature local server/CLI.
  5. What is the main caveat with PyTorch MPS? Op coverage and possible CPU fallback/performance surprises.

"On Apple Silicon, local LLM inference is often a memory-capacity story first and a compute story second."

Day 19 · Week 3
Further Reading

Go deeper.

Primary references and the companion notebook for today's exercise.

Docs

MLX

Apple's MLX documentation and API reference.

Open
Repo

MLX examples

Reference implementations including MLX-LM.

Open
Docs

PyTorch MPS

PyTorch's Apple GPU backend notes and caveats.

Open
Docs

llama.cpp Metal build

Build notes for llama.cpp with Metal acceleration.

Open
Notebook · Day 19

Apple Silicon & MLX notebook

Companion Jupyter notebook with runnable calculations and optional hardware-specific cells.

Open notebook