Tensors as the universal data structure — shape, dtype, device, strides, memory layout. How autograd actually works: forward DAG, backward walk, chain rule on graphs. PyTorch and MLX internals compared side by side. Every symbol decoded, every gradient verified by hand.
Yesterday you saw the math: matmuls, softmax, chain rule, cross-entropy. Today we wire all of that into the two abstractions every modern ML framework is built on:
dtype, shape, device. Where the numbers live.Get fluent with these and every framework — PyTorch, MLX, JAX, TensorFlow, NumPy + a hand-rolled autograd — feels familiar. We focus on PyTorch (cross-platform standard, what NVIDIA users mostly write) and MLX (Apple Silicon native, what we'll use whenever Mac performance matters). They look 90% alike; the 10% difference matters when you read source code.
By the end of the lesson you should be able to read a forward pass written in either framework, predict its shapes at every line, explain in plain English what loss.backward() does under the hood, and connect dtype choices to real inference memory costs.
shape, dtype, device, stride) and create one in PyTorch and MLX.einsum operations, and verify by running code..view() fails, and explain why a transposed tensor is non-contiguous.numel × bytes_per_element.grad_fn is, why gradients accumulate at branch points, and why inference should use torch.inference_mode().…look at any line of transformer forward-pass code, say exactly what shape comes out, how many bytes that tensor occupies in memory, whether it holds a gradient, and what device it lives on. That's the entire mental model for inference engineering.
Pin this picture before diving in:
Every concept today either lives on the tensor side of this picture (shape, dtype, device, strides, broadcasting) or on the autograd side (graph, grad_fn, accumulation, no-grad).
A tensor is an n-dimensional array of numbers with three pieces of metadata: dtype, shape, device.
5.0 — shape ()[1, 2, 3] — shape (3,)[[1, 2], [3, 4]] — shape (2, 2)(B, T, D), image (C, H, W)(B, C, H, W), attention scores (B, h, T, T)(B, T, C, H, W))A scalar is a special case of a vector, which is a special case of a matrix, which is a special case of a tensor. Every number in deep learning lives in a tensor.
import torch
x = torch.tensor([[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0]])
print(x.shape, x.dtype, x.device)
# torch.Size([2, 3]) torch.float32 cpu
A tensor is a view over a flat block of memory. Two pieces of metadata interpret that flat block as multi-dimensional:
(2, 3).For a contiguous (2, 3) matrix in row-major order, the flat layout is:
Strides are why transpose, slicing, and reshape are usually free — they just change the metadata, not the bytes. And they're why operations on non-contiguous tensors sometimes need a .contiguous() call: a kernel that expects row-major data won't work on transposed strides.
x = torch.arange(6).reshape(2, 3)
print(x.shape, x.stride()) # torch.Size([2, 3]) (3, 1)
print(x.T.shape, x.T.stride()) # torch.Size([3, 2]) (1, 3) same storage
print(x.T.is_contiguous()) # False
print(x.T.contiguous().stride()) # (2, 1) copied to a new buffer
Four operations all change a tensor's shape. Only one of them (.contiguous()) is ever guaranteed to copy. The rest manipulate metadata:
.view() and .T are free (no copy, just different strides). .T produces a non-contiguous tensor — calling .view() on it raises a RuntimeError. Use .reshape() or .contiguous().view() to recover.Every tensor has a dtype (data type) that determines how many bytes each element occupies and what range of values it can represent. Before any formulas, here's the concrete picture: an fp32 number uses 32 bits (1 sign + 8 exponent + 23 mantissa); bf16 truncates the mantissa to 7 bits but keeps the same 8-bit exponent — preserving fp32's wide range while halving the bytes.
bf16 keeps fp32's 8-bit exponent (wide range) but truncates the mantissa to 7 bits. fp16 has only 5 exponent bits — narrower range, overflow risk in activations. Both cost 2 bytes. This choice drives the KV-cache and weight-matrix memory budget for every LLM you deploy.| dtype | bits | bytes | exponent bits | mantissa bits | typical use |
|---|---|---|---|---|---|
float64 | 64 | 8 | 11 | 52 | scientific computing; almost never in DL |
float32 | 32 | 4 | 8 | 23 | training default, full precision |
bfloat16 | 16 | 2 | 8 | 7 | training & inference; same range as fp32 |
float16 | 16 | 2 | 5 | 10 | inference; narrower range (overflow risk) |
int8 | 8 | 1 | — | — | quantized inference (Day 22) |
int4 | 4 | 0.5 | — | — | aggressively quantized inference |
Memory formula. Total bytes for a tensor = numel() × bytes_per_element. For a model's weights: num_parameters × bytes_per_dtype. You can read this off in one line:
x = torch.randn(1024, 4096, dtype=torch.bfloat16)
print(x.numel() * x.element_size()) # 1024 * 4096 * 2 = 8,388,608 bytes = 8 MB
Concrete VRAM math. A 7B-parameter model in fp32 needs 7 × 10⁹ × 4 = 28 GB of VRAM just for the weights. In bf16: 14 GB. In int8: 7 GB. In int4: 3.5 GB. dtype is the single biggest lever for "does this model fit on this GPU." We'll see this lever pulled hard on Days 22 and 24.
The word "bfloat16" stands for Brain Float 16 — named after Google Brain, which developed it to enable training on TPUs with fp32 dynamic range. Its key advantage: you can simply truncate the lower 16 bits of an fp32 to get bf16. Conversely, you can widen bf16 to fp32 with a bit-shift and zero-fill — no rounding needed. fp16 requires a full format conversion.
| Backend | device string | Where it lives |
|---|---|---|
| CPU | "cpu" | system RAM, accessible to any framework |
| NVIDIA GPU | "cuda", "cuda:0" | GPU's HBM (separate memory bus) |
| Apple GPU via PyTorch MPS | "mps" | unified memory, accessed through Metal |
| Apple GPU via MLX | (implicit) | unified memory, native |
Two tensors on different devices can't directly interact. PyTorch raises RuntimeError: Expected all tensors to be on the same device. Move first, op second.
The word "tensor" comes from physics — specifically the work of Gregorio Ricci-Curbastro and Tullio Levi-Civita in the 1890s. Einstein used tensors in general relativity (1915) decades before they appeared in machine learning. PyTorch tensors are exactly the same mathematical objects — multilinear arrays — just with autograd bolted on.
Three classes of op cover ~90% of what you'll do:
Same shape in, same shape out. a + b, a * b, torch.exp(a), torch.relu(a). Concrete: [1,2,3] + [10,20,30] = [11,22,33].
Collapse one or more dimensions. a.sum(dim=-1), a.mean(dim=0), a.max(dim=1). Concrete: a (2, 3) tensor reduced along dim=1 (last) becomes shape (2,) — one value per row.
a = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.float32)
print(a.sum(dim=-1)) # tensor([ 6., 15.]) shape (2,)
print(a.sum(dim=0)) # tensor([5., 7., 9.]) shape (3,)
print(a.sum()) # tensor(21.) shape () full reduction
The workhorse from Day 1. (m, k) @ (k, n) → (m, n). Batched: (B, m, k) @ (B, k, n) → (B, m, n).
A = torch.randn(8, 16)
W = torch.randn(16, 4)
(A @ W).shape # torch.Size([8, 4])
Ab = torch.randn(2, 8, 16)
(Ab @ W).shape # torch.Size([2, 8, 4]) batch dim flows through
Broadcast — read: "stretch" — lets you add a (3,) bias vector to every row of a (5, 3) matrix without writing a loop or manually copying data. The framework stretches the smaller array virtually — no extra memory, no copy.
Concrete example first. We want to add a bias b = [10, 20, 30] to a matrix A with shape (4, 3). A has shape (4, 3); b has shape (3,). To apply the rule, align from the right:
This is the workhorse pattern in transformers: x + bias where x is (B, T, D) and bias is (D,) — the bias broadcasts over both batch and time dimensions simultaneously.
When shapes don't match, frameworks try to align them from the right. Each pair of aligned dims must either be equal or have one of them be 1 (which is then "stretched" virtually — no copy, no extra memory). Otherwise: error.
a = torch.randn(8, 1, 4)
b = torch.randn(3, 4)
(a + b).shape # torch.Size([8, 3, 4])
# Bias added to every (b, t) — workhorse pattern in transformers
x = torch.randn(2, 5, 8) # (B, T, D)
bias = torch.randn(8) # (D,)
(x + bias).shape # (2, 5, 8)
Failure case. Shapes (3, 4) and (2, 4): align from right, 4 == 4 ok, but 3 vs 2 — neither is 1 — error.
einsum lets you spell out indices and the framework figures out the loop. It is the clearest way to express tensor contractions; once it clicks you'll prefer it.
# Plain matmul: sum over j
C = torch.einsum('ij,jk->ik', A, B) # same as A @ B
# Batched matmul: keep batch b, sum over j
C = torch.einsum('bij,bjk->bik', A, B)
# Attention scores: Q (B, T, D), K (B, S, D) → (B, T, S)
scores = torch.einsum('btd,bsd->bts', Q, K)
# Equivalent to: Q @ K.transpose(-1, -2)
# Outer product:
outer = torch.einsum('i,j->ij', a, b)
# Trace (sum of diagonal):
tr = torch.einsum('ii->', M)
Rule of thumb: letters that appear on both sides are kept; letters that disappear are summed over.
Four close cousins. All change shape; only some change strides; none (usually) copy data.
x = torch.arange(24).reshape(2, 3, 4) # (2, 3, 4), strides (12, 4, 1)
x.view(6, 4) # (6, 4) — works because contiguous
x.transpose(0, 1) # (3, 2, 4), strides (4, 12, 1) non-contig
x.permute(2, 0, 1) # (4, 2, 3) — arbitrary axis reorder
.view() is strict — requires the tensor to already be contiguous, never copies. .reshape() is friendly — copies if necessary. After .transpose(), the tensor is non-contiguous; .view() will fail; use .reshape() or .contiguous().view(...).
The single biggest engineering difference between NVIDIA and Apple Silicon is the memory model.
On NVIDIA, CPU and GPU each have their own memory. Moving a tensor across PCIe is slow (~32 GB/s) compared to HBM bandwidth (~3.3 TB/s on H100, ~8 TB/s on a Blackwell B200). The rule is: move data once, keep it on the GPU.
x_cpu = torch.randn(1000, 1000)
x_gpu = x_cpu.to("cuda") # PCIe copy
y = x_gpu @ x_gpu.T # runs on GPU
back = y.cpu().numpy() # PCIe copy back
On Apple Silicon, CPU and GPU share the same physical RAM. There is no .to("device") that moves data — the GPU just reads the same bytes. This is why MLX has no device field on tensors at all.
import mlx.core as mx
x = mx.random.normal((1000, 1000)) # lives in unified memory
y = x @ x.T # GPU touches the same RAM
PyTorch's MPS backend works on Mac too (device="mps") but is generally slower than MLX for LLM workloads — the MPS backend bridges Metal through PyTorch's dispatcher rather than running natively on Apple Silicon's compute graph.
The practical rule: move data once, then keep it resident. Every .to("cuda") crosses PCIe at ~32 GB/s. That sounds fast, but HBM runs at ~3.3 TB/s (≈8 TB/s on Blackwell) — ~100× faster. Moving a 1 GB activation tensor from CPU to GPU costs ~31 ms; the matmul itself might cost 2 ms. The transfer is the bottleneck. Common mistake: loading a batch on CPU, running preprocessing, then moving to GPU inside the training loop — pay the transfer on every step.
# Good: move once, keep on device
x_gpu = dataset_tensor.to("cuda") # one transfer at the start
for step in range(1000):
loss = model(x_gpu[batch_idx]) # GPU → GPU, no transfer
# Bad: transfer every step
for step in range(1000):
x_batch = load_cpu_batch() # CPU
loss = model(x_batch.to("cuda")) # PCIe every step
Why this matters for inference. The KV cache in an LLM inference server stores key and value tensors for every layer and every token in the context. For a LLaMA-2 7B-style model with a 4096-token context in fp16, the KV cache alone is about 2 GB per active sequence. It lives on the GPU throughout the request. If it spills to CPU, inference throughput collapses. This is why vLLM's PagedAttention (Day 24) manages KV cache memory like a virtual memory system — the dtype × context × layers math is unforgiving.
Apple's MLX framework launched in December 2023 — making it one of the youngest major ML frameworks. Designed from scratch for unified memory architecture. Led by ex-DeepMind/Google Brain engineers Awni Hannun and team at Apple's MLR group.
We've been calling derivatives "the chain rule applied to a chain of functions." A real neural network isn't a chain — it's a DAG (directed acyclic graph). Autograd is the engineering that makes the chain rule work on DAGs, automatically, for any program you write.
The mental model in three sentences:
grad_fn pointer that knows the local derivative of that op..backward() on a scalar (the loss). Autograd walks the graph in reverse topological order, multiplying local derivatives via the chain rule.requires_grad=True. They get the accumulated dloss/dx deposited on x.grad.That's it. Everything below is detail.
Let's do a real walk on a graph with two inputs and a branch, so you see how gradients accumulate.
Forward. Define a = 2, b = 3. Compute:
That's a DAG (not a chain) because a is used twice (in c and d), and so is b. Here it is, with values plugged in:
Backward. Goal: dL/da and dL/db. Walk the graph in reverse, multiplying local derivatives. Start at the output and seed dL/dL = 1. For every node, compute its local derivative w.r.t. each input, multiply by the gradient already arrived at the output side, and pass the result back along the edge. At any node where multiple paths meet, sum the contributions.
| Step | Node | Local derivative(s) | Incoming grad | Outgoing grad |
|---|---|---|---|---|
| 0 | L | seed | — | dL/dL = 1 |
| 1 | L = e² | dL/de = 2e | 1 | dL/de = 2·30 = 60 |
| 2 | e = c·d | de/dc = d, de/dd = c | 60 | dL/dc = 60·5 = 300, dL/dd = 60·6 = 360 |
| 3a | c = a·b | dc/da = b, dc/db = a | 300 | contribution to dL/da = 300·3 = 900; to dL/db = 300·2 = 600 |
| 3b | d = a+b | dd/da = 1, dd/db = 1 | 360 | contribution to dL/da = 360·1 = 360; to dL/db = 360·1 = 360 |
| 4 | accumulate at leaves | sum the two paths | — | dL/da = 900 + 360 = 1260, dL/db = 600 + 360 = 960 |
Same picture, drawn with the gradients flowing backward:
Verify in PyTorch — same numbers fall out:
a = torch.tensor(2.0, requires_grad=True)
b = torch.tensor(3.0, requires_grad=True)
c = a * b
d = a + b
e = c * d
L = e ** 2
L.backward()
print(a.grad) # tensor(1260.)
print(b.grad) # tensor(960.)
The two takeaways:
grad_fn is per-tensor. Each non-leaf tensor stores a pointer to the function that created it. c.grad_fn is MulBackward, d.grad_fn is AddBackward, and so on. backward() follows these pointers.a is consumed by both c and d, each consumer pushes its own gradient back, and a.grad is the sum. This is just the multivariate chain rule: dL/da = (dL/dc)(dc/da) + (dL/dd)(dd/da).zero_grad)Autograd accumulates into .grad across calls, too — not just at branch points within one backward pass. This is on purpose: it lets you do gradient accumulation across mini-batches when you can't fit a big batch on one GPU. But if you forget to clear .grad before each step, your gradients are stale + summed forever.
opt.zero_grad() # or: for p in params: p.grad = None
loss.backward()
opt.step()
For a single scalar loss, what we want is the gradient vector. But internally autograd doesn't materialize huge Jacobians; for each op it only knows how to compute a vector-Jacobian product (VJP, sometimes called "backward op"): given the upstream gradient as a vector, push it back through the local op without ever forming the full Jacobian matrix.
This is why backward is roughly the same cost as forward, instead of O(parameters × outputs) — Jacobians can be enormous (billions × billions). VJPs sidestep that. We don't need the math here; just remember: every op has a forward kernel and a paired VJP kernel. When you write a custom op (Days 17+ for CUDA kernels), you write both.
During inference, there are no gradients to compute. Every op in the forward pass would normally build a graph node, save input tensors for the backward pass, and bump version counters. All of that is wasted work when you only want a prediction. Skip it with two context managers:
# PyTorch
with torch.no_grad():
out = model(x) # no graph recorded; less memory; faster
# Even faster — disables version counters too:
with torch.inference_mode():
out = model(x)
What's the difference? no_grad stops graph recording but still tracks tensor versions (so in-place ops stay safe inside the context). inference_mode disables everything — version counters, view tracking, the whole autograd machinery — and is strictly cheaper. Use inference_mode for production inference; use no_grad only when you need to do in-place ops or share tensors with code that checks versions.
inference_mode disables all autograd overhead — the cheapest non-model change you can make to speed up inference. Always pair it with model.eval().This is one of the cheapest performance wins in the whole stack — and the existence of inference_mode is part of why an "inference engine" can be 2-3× faster than a naïve forward call.
Autograd as a concept dates to the 1960s under the name "reverse-mode automatic differentiation." It was popularized in deep learning by HIPS autograd (2014) — built by students at Harvard's Intelligent Probabilistic Systems group. They proved you could write Python that looked like NumPy and get gradients for free.
import torch
import torch.nn as nn
import torch.nn.functional as F
# --- Tensors ---
x = torch.tensor([1.0, 2.0, 3.0]) # from list
x = torch.zeros(3, 4) # zeros
x = torch.ones(3, 4) # ones
x = torch.randn(3, 4) # standard normal
x = torch.arange(10).reshape(2, 5) # 0..9 reshaped
x = torch.empty(3, 4) # uninitialized (faster, garbage values)
# --- Devices ---
device = ("cuda" if torch.cuda.is_available()
else "mps" if torch.backends.mps.is_available()
else "cpu")
x = x.to(device)
Every layer, sub-network, or model in PyTorch is an nn.Module. Modules own three things: parameters (learnable tensors), buffers (non-learnable state that should still be saved/loaded — e.g., running batch-norm stats), and submodules.
class TwoLayer(nn.Module):
def __init__(self, d_in, d_hidden, d_out):
super().__init__()
self.fc1 = nn.Linear(d_in, d_hidden) # submodule
self.fc2 = nn.Linear(d_hidden, d_out) # submodule
self.register_buffer('step', torch.zeros(1)) # buffer (saved, not trained)
def forward(self, x):
return self.fc2(F.relu(self.fc1(x)))
m = TwoLayer(128, 256, 10).to(device)
for name, p in m.named_parameters():
print(name, tuple(p.shape), p.requires_grad)
# fc1.weight (256, 128) True
# fc1.bias (256,) True
# fc2.weight (10, 256) True
# fc2.bias (10,) True
# Save/load
torch.save(m.state_dict(), 'model.pt')
m2 = TwoLayer(128, 256, 10)
m2.load_state_dict(torch.load('model.pt'))
state_dict() is just an OrderedDict of parameter and buffer tensors keyed by name. It's the Hugging Face checkpoint format underneath all the JSON. When we load LLaMA weights on Day 27, we'll be doing exactly this lookup.
.train() vs .eval()Some layers (Dropout, BatchNorm, LayerNorm with running stats) behave differently in train vs eval. Toggle with:
m.train() # enable dropout, update batchnorm stats
m.eval() # disable dropout, freeze batchnorm stats
This is independent of no_grad. Real inference: m.eval() and torch.inference_mode().
opt = torch.optim.AdamW(m.parameters(), lr=1e-3)
for step in range(100):
x = torch.randn(8, 128, device=device)
target = torch.randint(0, 10, (8,), device=device)
logits = m(x)
loss = F.cross_entropy(logits, target)
opt.zero_grad()
loss.backward()
opt.step()
The four lines zero_grad → forward → backward → step are the heartbeat of every PyTorch training loop ever written. We'll see them again on Days 4, 9, 10.
PyTorch is eager and dynamic: every Python line runs immediately, and the graph is rebuilt every forward pass. The dispatcher routes each op (e.g., add) to a backend-specific kernel: CPU, CUDA, MPS, XLA. Most kernels live in ATen (a C++ tensor library); higher-level orchestration is in Python. Autograd is implemented as a "tape": every op pushes a grad_fn onto a thread-local list; backward() pops them in reverse.
Two consequences worth knowing:
if, for) just works.torch.compile (Day 11+) introduces a tracer that captures a static graph, fuses kernels, and can be 1.5-3× faster. Worth knowing it exists; we'll use it later.PyTorch was open-sourced by Facebook AI Research in October 2016. Before that, the dominant framework was Theano (2007–2017), then briefly TensorFlow 1.x (which made you build static graphs ahead of time). PyTorch's big innovation: dynamic graphs that are built as you run code.
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
# --- Arrays (no .device, no .to()) ---
x = mx.array([1.0, 2.0, 3.0])
x = mx.zeros((3, 4))
x = mx.random.normal((3, 4))
# --- Lazy evaluation: ops build a graph, don't compute yet ---
y = x * 2 + 1 # nothing has been computed
mx.eval(y) # forces evaluation
print(y) # also forces evaluation
Lazy eval is the single biggest mental shift coming from PyTorch. In MLX, y = x * 2 + 1 does not run x * 2 and then + 1. It records "compute this expression," and only on mx.eval(y) (or anything that must read the value, like print or numpy()) does the runtime walk the graph, fuse what it can into a single GPU kernel, and execute. This is how MLX gets a lot of its speed on Apple Silicon.
grad, value_and_grad, vmap, compileMLX's autograd is functional: grad(f) returns a new function that computes derivatives. There is no .backward() on the array; there is no .grad attribute either. Gradients are returned as data structures (pytrees) that mirror your model.
def loss_fn(params, x, target):
pred = params['W'] @ x + params['b']
return mx.mean((pred - target) ** 2)
grad_fn = mx.grad(loss_fn) # new function: returns dloss/dparams
grads = grad_fn(params, x, target)
For models, nn.value_and_grad returns both the loss and the grads in one shot:
def loss_fn(model, x, target):
return mx.mean((model(x) - target) ** 2)
loss_and_grad_fn = nn.value_and_grad(linear, loss_fn)
opt = optim.AdamW(learning_rate=1e-3)
for _ in range(100):
x = mx.random.normal((8, 128))
target = mx.random.normal((8, 64))
loss, grads = loss_and_grad_fn(linear, x, target)
opt.update(linear, grads)
mx.eval(linear.parameters(), opt.state) # force the update
Other transformations (we'll touch these on Days 19 and 21):
mx.vmap(f) — auto-batch a function written for a single example.mx.compile(f) — capture and fuse the graph; equivalent in spirit to torch.compile.mx.jvp(...) / mx.vjp(...) — explicit forward/reverse-mode building blocks.mx.evalLazy eval is great for fusion, but it can quietly grow the deferred graph if you never force evaluation. Two rules:
mx.eval(model.parameters(), opt.state). This caps the graph size.import time
x = mx.random.normal((4096, 4096))
t0 = time.time()
y = x @ x.T # graph built, not run
mx.eval(y) # actually run on GPU
print(time.time() - t0)
Because there's no PCIe to cross, MLX skips the entire .to(device) / .cuda() / .cpu() dance. The price is that MLX is Apple-Silicon-only — there is no CUDA backend.
The two frameworks share ~90% of their surface area. The 10% difference is architectural. Understanding it saves hours of debugging when you port code or read source.
| Dimension | PyTorch | MLX |
|---|---|---|
| Evaluation model | Eager — every op runs immediately as Python executes | Lazy — ops build a graph; execution deferred until mx.eval() or a value is needed |
| Graph | Rebuilt every forward pass (dynamic). Easy to debug. | Captured lazily. Runtime can fuse ops before executing. |
| Autograd style | Tape-based: each tensor stores a grad_fn; call loss.backward() | Functional: mx.grad(f) returns a new function; grads returned as pytree |
| Memory model | Discrete (NVIDIA): CPU RAM ↔ GPU HBM via PCIe | Unified (Apple Silicon): CPU and GPU share one address space |
| Device field | tensor.device — "cpu", "cuda", "mps" | No device field. All arrays are in unified memory. |
| Zero-grad needed | Yes — opt.zero_grad() before each backward | No — functional; each mx.grad call is fresh |
| Compile/fuse | torch.compile(model) — traces and fuses | mx.compile(f) — captures the lazy graph explicitly |
| Save weights | torch.save(m.state_dict(), 'w.pt') | mx.save_safetensors('w.safetensors', m.parameters()) |
| Target hardware | NVIDIA, AMD, Intel, CPU | Apple Silicon only |
When porting code, this is the cheat sheet:
| Concept | PyTorch | MLX |
|---|---|---|
| Tensor type | torch.Tensor | mx.array |
| Random normal | torch.randn(3, 4) | mx.random.normal((3, 4)) |
| Move to GPU | x.to('cuda') | (no-op — unified memory) |
| Matmul | A @ B | A @ B |
| Reshape | x.reshape(...) / .view(...) | x.reshape(...) |
| Reduction | x.sum(dim=-1) | x.sum(axis=-1) |
| einsum | torch.einsum(...) | mx.einsum(...) |
| Module base | nn.Module | nn.Module |
| Linear layer | nn.Linear(d_in, d_out) | nn.Linear(d_in, d_out) |
| Trainable params iter | m.parameters() | m.parameters() (returns dict) |
| Loss scalar | loss.item() | loss.item() |
| Gradients | loss.backward() then p.grad | grad_fn(...) returns grads; no .grad |
| Optimizer step | opt.step() after loss.backward() | opt.update(model, grads) |
| Zero grads | opt.zero_grad() | not needed (functional) |
| No-grad inference | with torch.no_grad(): ... | mx.stop_gradient(x) or just don't call grad |
| Force compute | always eager | mx.eval(...) |
| Compile / fuse | torch.compile(m) | mx.compile(f) |
| Save state | torch.save(m.state_dict(), 'p.pt') | mx.save_safetensors('p.safetensors', m.parameters()) |
Rule of thumb. If a PyTorch line uses device, drop it for MLX. If a PyTorch loop uses loss.backward() + opt.step(), replace with loss, grads = value_and_grad_fn(...); opt.update(model, grads); mx.eval(...).
GPU work is asynchronous. The CPU enqueues kernels; the GPU executes them later. If you time naively, you measure how long it took to launch the work, not to do it.
import time, torch
x = torch.randn(4096, 4096, device='cuda')
torch.cuda.synchronize()
t0 = time.time()
y = x @ x.T
torch.cuda.synchronize() # wait for GPU
print(time.time() - t0)
Force eval the result before stopping the timer:
import time
import mlx.core as mx
x = mx.random.normal((4096, 4096))
t0 = time.time()
y = x @ x.T
mx.eval(y)
print(time.time() - t0)
Profilers: PyTorch Profiler, Nsight Systems (NVIDIA), mx.profiler (MLX). We'll use these heavily in Week 3.
"80% of ML bugs are shape mismatches. Add print(x.shape) liberally."
(2, 3, 4) tensor of standard-normal values. Reshape to (6, 4) and (2, 12). Take the mean along the last axis. Try x.sum(dim=(0, 2)) — predict the shape, then verify.x = torch.arange(12).reshape(3, 4). Print x.stride(). y = x.T; print y.stride() and y.is_contiguous(). Try y.view(12) — read the error. Then y.reshape(12) — why does that work?a = torch.arange(12).reshape(3, 4). Add a row vector [10, 20, 30, 40] to every row. Add a column vector [[100], [200], [300]] to every column. Try (torch.zeros(3, 4) + torch.zeros(2, 4)) — read the error.Q, K = torch.randn(2, 5, 8), torch.randn(2, 5, 8). Compute attention scores (2, 5, 5) with einsum. Verify via Q @ K.transpose(-1, -2).a=2, b=3). Run the PyTorch snippet, confirm 1260, 960. Now change L = e² to L = e³; predict dL/da and dL/db on paper, then verify.x = torch.tensor(2.0, requires_grad=True); y = x * x + x; y.backward(); print(x.grad). Predict before running.nn.Linear(4096, 4096) on a 4096×4096 input with and without torch.inference_mode(). Use cuda.synchronize if on CUDA. Compare wall time and peak memory.nn.Linear → ReLU → nn.Linear → MSE training step from PyTorch to MLX. Use the translation table above. Run for 100 steps; watch loss go down in both.Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.
Builds autograd from scratch in pure Python. Watch this if anything in this lesson felt magical.
Watch on YouTubeThe whole concept of automatic differentiation in 150 lines you can read in one sitting.
Open repoBest one-hour read on how PyTorch is built. Tensors, autograd engine, dispatcher, ATen.
Read postStevens, Antiga, Viehmann. Best PyTorch-focused book — covers tensors, autograd, training loop in depth.
Open pageProduction-ready MLX implementations: LLM inference, Whisper, Stable Diffusion, and more.
Open repoThe canonical survey paper on AD in machine learning. Forward mode, reverse mode, dual numbers, the lot.
Read on arXivReadable tensor rearrangement notation. Works with PyTorch, NumPy, MLX, JAX, TensorFlow.
Open docsThe canonical einsum tutorial. Worth committing to memory.
Read postDecember 2023 launch announcement explaining the design philosophy behind MLX.
Read postThe official deep-dive on how PyTorch's autograd engine works: grad_fn, leaf tensors, accumulation, no_grad, inference_mode, and custom backward functions.
Read docsHow mx.grad, mx.vmap, mx.vjp, mx.jvp, and mx.compile work. The functional autograd design that makes zero_grad unnecessary.
Close the page and answer from memory. If you can't, re-read the relevant section.
(3, 4) tensor x is transposed to get y = x.T. What are y.shape, y.stride(), and y.is_contiguous()? Why does y.view(12) raise a RuntimeError? How do you flatten it anyway?bfloat16. How many bytes of VRAM does it occupy for weights alone? Show the arithmetic. What is the formula in general?bf16 preserve fp32's dynamic range while fp16 does not? Give a one-sentence answer in terms of exponent bits.(8, 1, 4) and (3, 4) are added. What is the output shape? Walk the broadcasting rules step by step. Now try (8, 1, 4) + (2, 4) — does it work? Why or why not?a=2, b=3), walk both paths that contribute to a.grad. What are the two numerical contributions, and why do they add rather than one overwriting the other?tensor.grad accumulate across multiple .backward() calls, not just within one? Name a situation where this is a feature and one where it's a bug.grad_fn? Which tensors have one? Which do not? What does .backward() do with it?torch.inference_mode() disable that torch.no_grad() does not? When would you still prefer no_grad?(B, T, D) tensor and want a (B, D) tensor of mean activations across the time dimension, write the call in PyTorch and in MLX (note the different keyword for axis).