Build an MLP by hand: forward, loss, manual backward pass. Then the same thing in PyTorch and MLX. Every gradient derived. Every shape labeled.
If the multi-layer perceptron isn't in your bones — if you can't write its forward pass on a whiteboard, derive its backward by hand, predict its shapes, and explain why ReLU rescued deep learning — then transformers will feel mystical for the rest of the curriculum. They aren't. They're MLPs with attention bolted on.
So today we kill the mystery. We do it the slow way first — pure NumPy, no autograd, every gradient derived from the chain rule on Day 1 — and then we re-do it in PyTorch and MLX. The slow way takes 30 lines and proves the libraries aren't magic. The fast way takes 8 lines and is exactly what every modern LLM training loop does.
Concrete reference points for what builds on this lesson:
Attention → MLP → Attention → MLP → ..., where each MLP is literally the network you build today, just with d_model = 4096 instead of 16.state_dict we'll meet today.If anything below feels fuzzy by the end, the commit is: re-type, don't copy-paste, the NumPy MLP. Type until loss drops. Don't move on until you can re-derive the backward pass on paper.
(p − y)/B.√(2/fan_in) constant, and explain why a poor init breaks training.Every neural network — including a 70B Transformer — fits this picture:
A 70B-parameter LLM is the same picture with attention slotted in between linear blocks and D = 8192. Don't let the scale fool you — the algorithm is what you're about to write.
A neuron is the unit of computation. It does three things:
x = (x₁, …, x_n).z = w·x + b = Σᵢ wᵢ xᵢ + b.z through a non-linear activation f, returning y = f(z).That's it. That's the entire neuron. Geometrically, w · x = ‖w‖‖x‖cos θ, so the dot product measures alignment between w and x. The bias b shifts the activation threshold. The activation f introduces non-linearity.
f = sigmoid and a 0/1 target is logistic regression.w · x (a weighted sum — Σ means sum(...)), plus a bias b, then a non-linear function f. Stack d_out of these side-by-side and pack the weights into a matrix: you get a linear layer.A linear layer with d_in inputs and d_out outputs is d_out neurons sharing the same input vector. Pack all the weights into a matrix W (a 2D array of floats, shape (d_in, d_out)) and all the biases into a vector b (a 1D array of floats, length d_out). For a batch X (shape (B, d_in)) of B inputs at once:
The bias broadcasts across the batch. Same matmul rules as Day 1.
In modern LLMs (LLaMA, Mistral, Qwen), the linear layers in the FFN often have no bias. Two reasons:
d_model = 4096, an extra 4096-element vector per linear is rounding error in parameter count, but it's another fragile thing to learn. Empirically, removing biases doesn't hurt loss.GPT-2 had biases. Most post-2022 architectures don't. We'll keep biases today because we're learning, but it's worth knowing the convention shifted.
The first neural network ever built — the Perceptron — was constructed in 1958 by Frank Rosenblatt, not in software but as a physical machine with 400 photocells and motorized weights. The New York Times reported it would soon "be conscious of its existence." Then it failed at XOR. AI entered its first winter.
Stack two linear layers with no activation between them:
Matrix multiplication is associative — so W₁ W₂ is just another matrix. Stacking 100 linear layers without activations collapses to one equivalent linear layer. The whole network can only learn linear functions of its inputs.
The proof is one line. Concrete demo:
import numpy as np
W1 = np.random.randn(4, 16)
W2 = np.random.randn(16, 3)
x = np.random.randn(4)
print(((x @ W1) @ W2)) # two linear ops
print(x @ (W1 @ W2)) # one linear op — same result
# array agreement: tiny float drift only
Adding any non-linearity between the layers — max(0, ·), tanh, anything — breaks associativity and the network can suddenly approximate functions a single linear layer can't.
Universal Approximation Theorem (Hornik 1989, Cybenko 1989). A feed-forward network with at least one hidden layer and a non-polynomial activation can approximate any continuous function on a compact set, given enough neurons. The theorem doesn't say how many neurons — it might need a billion — but it justifies the architecture. Depth × non-linearity = expressivity.
Each activation is a scalar function f: ℝ → ℝ applied element-wise to the layer pre-activation z. Their derivatives matter, because backward will multiply by them.
| Activation | Formula | Derivative | Output range | Used in |
|---|---|---|---|---|
| ReLU | max(0, x) | 1 if x > 0 else 0 | [0, ∞) | The default since 2012. Vision, vanilla MLPs, GPT-1/2 FFN. |
| Leaky ReLU | max(αx, x), α≈0.01 | 1 if x > 0 else α | (−∞, ∞) | When ReLU "dies" (neurons stuck at zero). |
| GELU | x · Φ(x) (Φ = Gaussian CDF) | Φ(x) + x · φ(x) | ≈[−0.17, ∞) | BERT, GPT-2, GPT-3. Smooth ReLU. |
| SiLU / Swish | x · σ(x) | σ(x) + x·σ(x)(1−σ(x)) | ≈[−0.28, ∞) | LLaMA, Mistral, Qwen FFN. Often inside SwiGLU. |
| tanh | (eˣ − e⁻ˣ)/(eˣ + e⁻ˣ) | 1 − tanh²(x) | (−1, 1) | RNNs, GRU/LSTM gates, sometimes policy heads. |
| Sigmoid | 1 / (1 + e⁻ˣ) | σ(x)(1 − σ(x)) | (0, 1) | Final layer for binary classification; gates inside LSTMs. |
| Softmax | eˣⁱ / Σⱼ eˣʲ | vector-valued (see below) | row sums to 1 | The output of every classifier. The weights inside attention. |
When you stack tanh activations and the pre-activations get large, tanh'(z) = 1 − tanh²(z) ≈ 0 for |z| > 2. Multiplying small gradients through 50 layers of backward gives you something effectively zero. Early-layer parameters stop receiving useful updates. Training stalls.
ReLU's derivative is exactly 1 for x > 0 — no decay. Stacking 100 ReLU layers, the gradient still has full magnitude in the active half. This single change, plus better initialization (next section), is what made deep nets viable. Before 2012, "deep" meant 3-5 layers; after, hundreds.
For multi-class classification with C classes, the network outputs C real numbers — logits — one per class. Softmax converts them into a probability distribution:
Two important properties:
softmax(z) = softmax(z − max(z)). We always subtract the max in practice for numerical stability — otherwise eᶻ overflows.def softmax_stable(z):
z = z - z.max(axis=-1, keepdims=True) # numerical safety
e = np.exp(z)
return e / e.sum(axis=-1, keepdims=True)
Softmax has a vector-valued derivative (a Jacobian, not a scalar). We never compute it directly — we always pair it with cross-entropy loss, and the product simplifies to something gorgeous. Two sections from now.
Let's build a 2-layer MLP for classification:
Forward pass, with every intermediate shape labeled:
That's the whole forward pass: two matmuls, one ReLU, one softmax, one cross-entropy reduction. Every transformer FFN follows the same shape — they just have bigger numbers.
Parameter count for this toy net: (4·16 + 16) + (16·3 + 3) = 80 + 51 = 131 trainable scalars.
def forward(X):
z1 = X @ W1 + b1 # (B, hidden_dim)
a1 = np.maximum(0, z1) # ReLU
z2 = a1 @ W2 + b2 # (B, out_dim) — logits
return z1, a1, z2 # cache for backward
We return the intermediates z1 and a1, not just the output z2. The backward pass needs them. This is the same trick autograd does under the hood — store activations during forward so you can multiply by them on the way back.
Activations dominate VRAM during training. Quick math for a Transformer block at B=8, T=2048, D=4096:
For 32 layers stacked, that's ~17 GB of activations alone, just for the FFN intermediates. This is why gradient checkpointing (recomputing some activations on backward instead of storing them) is everywhere in LLM training. We'll meet it on Day 10.
The transformer FFN sublayer — the one you'll actually be serving — is exactly this two-linear-layer MLP, just with bigger dimensions. In LLaMA-7B, the FFN expands to 4 × d_model = 16384 hidden units and uses SiLU (a smooth ReLU lookalike). Every inference request runs:
Params vs FLOPs. One FFN layer in a 7B model has ~3 × 4096 × 16384 ≈ 200M parameters. At batch size 1 (common during inference), every parameter is loaded from memory but used for just a handful of multiplications — so FFN layers are memory-bandwidth bound, not compute bound. Reducing FFN weights (quantization, pruning) is the single biggest lever on serving cost. You'll need to understand this MLP before any of that makes sense.
For classification, cross-entropy is the universal loss. From information theory (Day 1, Shannon's entropy): the average number of bits you need to encode the true distribution if you assume the model's distribution. Lower is better; zero is perfection.
Formally, for one example with true class y and predicted distribution p:
That log p_y is the only term that survives because all other indicators are zero. Average over the batch:
If the model is perfectly confident on the correct class (p_y = 1), the loss is −log 1 = 0. If it's spreading mass evenly across C classes (p_y = 1/C), the loss is log C — a useful baseline. For C = 10 (MNIST), log 10 ≈ 2.30. A randomly initialized 10-class classifier should output a loss of ~2.3 on the first batch. If yours doesn't, you've already got a bug.
def cross_entropy_loss(logits, y):
# logits: (B, C); y: (B,) integer labels
z = logits - logits.max(axis=-1, keepdims=True) # stability
log_sum_exp = np.log(np.exp(z).sum(axis=-1, keepdims=True))
log_p = z - log_sum_exp # (B, C)
nll = -log_p[np.arange(len(y)), y] # (B,)
return nll.mean(), softmax(logits)
Practical note. Always compute cross-entropy on raw logits (F.cross_entropy(logits, y) in PyTorch). Computing softmax then log separately loses precision — log(softmax(z)) is mathematically the same but numerically worse. The fused log_softmax + nll_loss (which cross_entropy uses) is the right call.
Here's the magic. When you compose p = softmax(z) and L = − log p_y, the derivative simplifies dramatically:
In words: the gradient of the loss with respect to the logits is just the predicted distribution minus the one-hot true label. No softmax Jacobian. No log-derivative. Just (p − y).
For a batch, divide by B because we averaged the loss:
This is the gradient that starts the entire backward walk. Memorize it. It's the reason every classifier ever has trained as fast as it does.
The one-line derivation:
This clean form is one reason cross-entropy is the universal classification loss. Squared error on probabilities (mean-squared error on softmax outputs) gives a sloppier gradient with (p − y) · p · (1 − p) factors that vanish at the extremes — bad for optimization.
Cross-entropy loss comes directly from Claude Shannon's 1948 information theory. Shannon was answering a different question (how to compress messages efficiently); the same math measures how surprised your model is by the correct answer. Low cross-entropy = the model wasn't surprised. Compression and intelligence are the same problem in disguise.
We just got dz₂ = (p − one_hot(y)) / B. From there, the chain rule goes layer by layer.
The cheat sheet — derivatives of every primitive in our forward pass:
| Forward op | Local derivative | Notes |
|---|---|---|
z₂ = a₁ W₂ + b₂ | ∂L/∂W₂ = a₁ᵀ · dz₂, ∂L/∂b₂ = Σ dz₂, ∂L/∂a₁ = dz₂ · W₂ᵀ | Standard linear layer backward |
a₁ = ReLU(z₁) | ∂a₁/∂z₁ = 1[z₁ > 0] | Element-wise mask |
z₁ = X W₁ + b₁ | ∂L/∂W₁ = Xᵀ · dz₁, ∂L/∂b₁ = Σ dz₁, ∂L/∂X = dz₁ · W₁ᵀ | Same shape; upstream is dz₁ |
Apply them in reverse. With dz₂ already in hand:
Six lines. Every gradient autograd would compute, by hand. Match them to your forward pass shapes — there's exactly one transposition per matmul, and biases sum across the batch axis (because each example contributed independently to the bias).
Each row i of the batch contributes bᵢ independently. So ∂L/∂b = Σᵢ ∂L/∂(z + b)ᵢ = Σᵢ dzᵢ. The gradient w.r.t. the broadcasted dimension is the sum across that dimension. This is a general rule — the gradient w.r.t. a broadcasted operand is the reduction of the gradient over the broadcast axes. PyTorch and MLX both implement this correctly.
Xᵀ · dzFor z = X · W, we have zₖₗ = Σⱼ Xₖⱼ · Wⱼₗ. So ∂zₖₗ/∂Wᵢⱼ = Xₖᵢ · 1[j = ℓ]. Chain rule:
In one line: transpose the input, matmul with the upstream gradient, get the weight gradient. Same shape as W. Dimension check: (d_in, B) @ (B, d_out) = (d_in, d_out) ✓.
ReLU seems obvious now, but neural networks used tanh and sigmoid almost exclusively until Glorot et al. 2011 showed ReLU dramatically improved deep network training. The fix for the vanishing gradient problem — max(0, x) — was hiding in plain sight for decades.
You can have the right architecture, the right loss, the right optimizer, and still training won't converge — because the weights started at the wrong scale.
The intuition: a forward pass through L linear layers multiplies the input variance by roughly (σ²·n)ᴸ where σ is the weight scale and n is the fan-in. If σ²·n > 1, activations explode (NaNs by layer 30). If σ²·n < 1, activations vanish (every neuron outputs roughly zero by layer 30, no gradient flows back). You need σ²·n ≈ 1.
There's a similar story on the backward pass. With activation f, the variance of gradients flowing backward through the layer is multiplied by (σ²·n_out · 𝔼[f'(z)²]). So the "right" σ depends on the activation function:
| Activation | Best init (variance) | Common name | Used with |
|---|---|---|---|
tanh, sigmoid | σ² = 1 / fan_in | Xavier / Glorot | RNNs, older models |
ReLU, leaky ReLU | σ² = 2 / fan_in | He / Kaiming | CNNs, MLPs, transformer FFN |
| Generic linear / no activation | σ² = 1 / fan_in | Xavier-uniform | Final classifier head |
| Modern transformer linears | σ ≈ 0.02 (constant) | GPT-2 init | LLaMA, GPT-2/3, etc. |
Why 2 / fan_in for ReLU? Because ReLU zeroes out half the activations in expectation, so to keep the active half's variance unit, you need to double the initial weight variance. Xavier was derived for symmetric tanh; He generalized it to ReLU's asymmetry.
# He init for ReLU networks
W1 = rng.standard_normal((in_dim, hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim) # bias init = 0 (always)
Bias init is always zero. There's nothing asymmetric about a bias; starting it at zero gives the network a clean reference and lets the data move it.
Modern transformer trick. GPT-2 and most descendants use a constant σ = 0.02 for all linears, and additionally scale residual-branch outputs by 1/√(2L) (where L is layer count) to prevent output magnitudes from blowing up as you stack more blocks. We'll see this on Day 12 when we build a real GPT.
A botched init is among the most common training-failure modes — and one of the easiest to debug. If your loss starts at 50 instead of log(C), your init is wrong.
Given gradients, stochastic gradient descent moves each parameter in the negative-gradient direction:
η (the Greek letter eta) is the learning rate — how large a step to take. Concrete numbers (η = 0.05, dW₂[0,0] = 0.13):
Each step nudges every parameter by η · grad. Repeat thousands of times.
Learning rate intuition. Too low: training crawls. Too high: loss bounces or NaNs out — you're stepping past the minimum. Empirical default for SGD on small MLPs: η ∈ [0.01, 0.1]. For Adam (Day 4): η ∈ [1e-4, 3e-4]. Loss curves diverging by step 2 = your η is way too high.
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
That's vanilla SGD. Day 4 replaces it with momentum, then Adam, then AdamW — all small variations on this single line.
Three options for choosing how many examples per gradient step:
B examples per step (typically 32–512). The compromise everyone uses.Mini-batches give you three benefits at once:
B = 256 fits.Modern LLM training typically uses batch sizes of 1–4 million tokens (collected via gradient accumulation across many GPUs) — the noise is so small that AdamW essentially behaves like full-batch.
This is the lesson. Type it. Run it. Watch loss drop.
import numpy as np
rng = np.random.default_rng(42)
# ---- Hyperparameters ----
in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32
lr = 0.05
n_steps = 2000
# ---- Data: 3 Gaussian blobs in 4D ----
N = 1024
class_centers = rng.standard_normal((out_dim, in_dim)) * 3.0
y = rng.integers(0, out_dim, N)
X = rng.standard_normal((N, in_dim)) + class_centers[y]
# ---- Init: He for ReLU, zero bias ----
W1 = rng.standard_normal((in_dim, hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim)
W2 = rng.standard_normal((hidden_dim, out_dim)) * np.sqrt(2.0 / hidden_dim)
b2 = np.zeros(out_dim)
def relu(z): return np.maximum(0, z)
def softmax(z):
z = z - z.max(axis=-1, keepdims=True)
e = np.exp(z); return e / e.sum(axis=-1, keepdims=True)
def forward(X):
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
return z1, a1, z2
def cross_entropy(logits, y):
p = softmax(logits)
log_p_correct = np.log(p[np.arange(len(y)), y] + 1e-12)
return -log_p_correct.mean(), p
def backward(X, y, z1, a1, p):
B = X.shape[0]
dz2 = p.copy()
dz2[np.arange(B), y] -= 1.0
dz2 /= B
dW2 = a1.T @ dz2
db2 = dz2.sum(axis=0)
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0)
dW1 = X.T @ dz1
db1 = dz1.sum(axis=0)
return dW1, db1, dW2, db2
# ---- Training loop ----
for step in range(n_steps):
idx = rng.choice(N, batch_size, replace=False)
Xb, yb = X[idx], y[idx]
z1, a1, z2 = forward(Xb)
loss, p = cross_entropy(z2, yb)
dW1, db1, dW2, db2 = backward(Xb, yb, z1, a1, p)
W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2
if step % 200 == 0:
acc = (z2.argmax(-1) == yb).mean()
print(f"step {step:4d} loss {loss:.4f} train-acc {acc:.3f}")
What you should see — three rules of thumb that work for almost any MLP:
log(out_dim). Here log(3) ≈ 1.10. If your first step prints something wildly different (50? NaN?) your init is broken.If any of those three are off, stop and debug:
NaN: lower lr, double-check init.lr too low or weights stuck at zero (no init or all-zero init).lr way too high, you're bouncing past the minimum.Identical math, autograd does the backward.
import torch
import torch.nn as nn
import torch.nn.functional as F
torch.manual_seed(42)
in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32
class MLP(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, out_dim)
def forward(self, x):
return self.fc2(F.relu(self.fc1(x)))
model = MLP()
opt = torch.optim.SGD(model.parameters(), lr=0.05)
# (Generate Gaussian-blob data as in the NumPy version.)
X = torch.randn(1024, in_dim) + class_centers[y]
y = torch.from_numpy(y)
for step in range(2000):
idx = torch.randperm(len(X))[:batch_size]
Xb, yb = X[idx], y[idx]
logits = model(Xb)
loss = F.cross_entropy(logits, yb) # softmax + NLL fused
opt.zero_grad()
loss.backward()
opt.step()
if step % 200 == 0:
acc = (logits.argmax(-1) == yb).float().mean()
print(f"step {step:4d} loss {loss.item():.4f} train-acc {acc:.3f}")
Same hyperparameters, same data, same outcome. Side-by-side, the differences are mechanical:
| What | NumPy | PyTorch |
|---|---|---|
| Parameter container | W1, b1, W2, b2 globals | model.parameters() from nn.Module |
| Init | manual np.sqrt(2/fan_in) | automatic (Kaiming uniform default) |
| Forward | forward(X) returning intermediates | model(X) returning logits |
| Loss | hand-written cross-entropy | F.cross_entropy(logits, y) (fused) |
| Backward | backward(...) returning grads | loss.backward() |
| Update | W1 -= lr * dW1; … | opt.step() |
| Zero grads | n/a | opt.zero_grad() before backward |
nn.Module does three things for you: tracks parameters, registers them with the optimizer, routes the forward pass through registered submodules. Everything else is the same.
The four-line heartbeat — opt.zero_grad() → loss.backward() → opt.step() (preceded by the forward) — is what every PyTorch training loop ever written looks like. Same on Day 9 (tiny GPT), Day 10 (mixed precision), Day 11 (distributed). Memorize it.
For Apple Silicon, MLX is the native equivalent. The structure is functional — gradients are returned from a function, not deposited on .grad.
import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim
mx.random.seed(42)
class MLP(nn.Module):
def __init__(self, in_dim, hidden_dim, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, out_dim)
def __call__(self, x):
return self.fc2(nn.relu(self.fc1(x)))
model = MLP(4, 16, 3)
opt = optim.SGD(learning_rate=0.05)
X = mx.random.normal((1024, 4))
y = mx.random.randint(0, 3, (1024,))
def loss_fn(model, x, y):
return nn.losses.cross_entropy(model(x), y, reduction="mean")
loss_and_grad = nn.value_and_grad(model, loss_fn)
for step in range(2000):
idx = mx.random.permutation(len(X))[:32]
Xb, yb = X[idx], y[idx]
loss, grads = loss_and_grad(model, Xb, yb)
opt.update(model, grads)
mx.eval(model.parameters(), opt.state) # cap deferred graph
if step % 200 == 0:
print(f"step {step:4d} loss {loss.item():.4f}")
Three things to notice vs PyTorch:
.grad on tensors. loss_and_grad returns the loss and a parameter pytree of gradients in one call.zero_grad. Functional API, every call returns fresh grads.mx.eval after each step. MLX is lazy — without periodic eval, the deferred computation graph grows unboundedly.Day 2 covered the lazy-eval mental model in depth; we're just applying it.
"PyTorch is just a giant chain-rule engine. MLX is a giant chain-rule engine that defers."
Train and validate. Plot both losses against step.
Underfitting. Both training and validation loss are high and flat-ish. The model can't capture the pattern — it's too small, the lr is too low, or the features are inadequate. Counters: bigger model, longer training, better features.
Overfitting. Training loss keeps dropping; validation loss bottoms out and climbs back up. The model has memorized the training set and is now noise-fitting. Counters: more data (the cleanest fix), regularization (weight decay, dropout), early stopping at the validation minimum, smaller model.
Just right. Both losses drop, validation tracks training, the gap is small. This is what you want.
LLM pre-training is a curious case: the data set is so much larger than the model's capacity that overfitting is rare — the model never sees the same example twice. The opposite problem dominates: how to train more efficiently on a fixed compute budget. We'll get there on Day 8 when we cover scaling laws.
How much compute should you put where? Two extremes:
Empirically, depth wins for most real tasks — but only if you can train it. Modern transformers go ~30-100 layers deep with d_model = 4096. The reason this works: residual connections (Day 7), pre-LayerNorm (Day 7), and well-tuned init (this lesson). Without all three, training a 30-layer transformer fails.
Quick capacity rules of thumb:
d_model 768-8192, 4 · d_model FFN width.MNIST — the dataset of 70,000 handwritten digits — was created in 1998 by Yann LeCun and team. It's been used in essentially every deep learning paper that needed a sanity-check baseline. People still use it. It's the "Hello World" of ML.
Run these in a Python REPL or notebook. Type the code; don't copy-paste. The point is to feel the shapes.
Run it. Watch loss drop. Then perturb:
lr = 5.0. What does the loss curve look like? Why?lr = 1e-5. How many steps before any progress?hidden_dim = 1. Why does accuracy plateau at ~33%?randn). Does training still converge? How many steps does it take?Extend the NumPy version to d_in → 16 → 16 → 3. Derive the new backward by hand on paper. Implement it, train it, verify it converges.
In both NumPy and PyTorch versions. The derivative of tanh(z) is 1 − tanh²(z). Update the NumPy backward; PyTorch handles it for you. Compare convergence speeds — tanh is typically slower in deep nets.
Predict on paper, then verify. (Hint: random init means roughly uniform predictions across 3 classes.)
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
train_ds = datasets.MNIST('./data', train=True, download=True,
transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)
Flatten the 28×28 image to a 784-vector. Build a 784 → 256 → 10 MLP. Train for 5 epochs. Target test accuracy: ≥ 95%. Plot training and validation loss against step. Is your network overfitting? Add weight decay (SGD(..., weight_decay=1e-4)) and re-plot.
Translate your MNIST MLP from PyTorch to MLX. Compare wall-clock training time. (M-series Macs typically train a 256-hidden MLP on MNIST faster than CUDA on integrated GPUs and on par with mid-range discrete GPUs.)
Five common training-loop bugs. Read the symptoms. Predict which line caused each:
| Symptom | Likely bug |
|---|---|
Loss is NaN after step 1 | (a) (b) (c) (d) (e) |
| Loss is constant for 1000 steps | (a) (b) (c) (d) (e) |
| Loss decreases on train but val accuracy is 10% (random) | (a) (b) (c) (d) (e) |
| Train accuracy is 99%, val is 60% | (a) (b) (c) (d) (e) |
| Loss starts at ~50 instead of ~2.30 | (a) (b) (c) (d) (e) |
(a) Forgot opt.zero_grad() (grads accumulate)
(b) Used lr = 100
(c) Used lr = 1e-9
(d) Forgot to scale init by √(2/n) → init too large
(e) Forgot to shuffle the dataset
Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.
The single best resource for today's lesson. Builds an autograd-equipped neural network from scratch in pure Python from absolute zero.
Watch on YouTubeChapters 1-2. The clearest backprop introduction ever written.
Read onlineDeep Feedforward Networks. The canonical textbook chapter.
Read onlineEpisode 4 of the neural networks series. Animations explain the chain rule better than equations.
Watch on YouTubeTrain a 2-layer net in your browser. Watch decision boundaries form. Pure intuition.
Open playgroundThe Nature paper that popularized backprop. Read for historical context.
Open paperWhere He init comes from. Section 2.2 has the variance derivation.
Read on arXivWhere Xavier init comes from. The paper that started the init revolution.
Open paperWhere SiLU/Swish comes from. Used in LLaMA, Mistral, Qwen FFN.
Read on arXivChristopher Olah's clean visual explanation of forward vs reverse mode autodiff.
Read postMLX's equivalent linear layer. Same architecture, slightly different conventions.
Open fileClose the page and answer from memory. If you can't, re-read the relevant section.
Y = (X W₁) W₂ are equivalent to a single linear layer with weight matrix W = ?. Why does this break with a non-linearity in between?√(2/fan_in). Where does the 2 come from?db = dz.sum(axis=0)? What's special about a bias?log(C) for many steps and never moves, what's the most likely bug?