Day 03 — Neural Networks From Scratch · LLM Inference Engineer Curriculum

Why This Lesson

A Transformer is, fundamentally, MLPs plus attention.

If the multi-layer perceptron isn't in your bones — if you can't write its forward pass on a whiteboard, derive its backward by hand, predict its shapes, and explain why ReLU rescued deep learning — then transformers will feel mystical for the rest of the curriculum. They aren't. They're MLPs with attention bolted on.

So today we kill the mystery. We do it the slow way first — pure NumPy, no autograd, every gradient derived from the chain rule on Day 1 — and then we re-do it in PyTorch and MLX. The slow way takes 30 lines and proves the libraries aren't magic. The fast way takes 8 lines and is exactly what every modern LLM training loop does.

Concrete reference points for what builds on this lesson:

Day 4 — backprop & optimizers — generalizes today's by-hand backward to arbitrary graphs and replaces SGD with Adam/AdamW.
Day 6 — attention is, in its inner loop, an MLP with one extra step (the softmax).
Day 7 — a Transformer block is Attention → MLP → Attention → MLP → ..., where each MLP is literally the network you build today, just with d_model = 4096 instead of 16.
Days 9–10 — when we train tiny GPT, the optimizer touches exactly the kind of state_dict we'll meet today.

If anything below feels fuzzy by the end, the commit is: re-type, don't copy-paste, the NumPy MLP. Type until loss drops. Don't move on until you can re-derive the backward pass on paper.

Learning objectives

Define a neuron in one line of math and a linear layer as a batched matrix multiplication.
Predict the output shape of every layer in a 2- or 3-layer MLP.
Explain why non-linear activations are essential — and back the explanation with a concrete demo that two stacked linears collapse to one.
Compare ReLU, Leaky ReLU, GELU, SiLU, tanh, sigmoid: shape, derivative, when each is preferred, and which one modern LLMs use.
Derive the gradients of an MLP w.r.t. every parameter by hand. Show that softmax + cross-entropy collapses to (p − y)/B.
Implement a 2-layer MLP from scratch in NumPy, train it on a synthetic dataset, watch loss drop.
Reproduce the same model in PyTorch and MLX, line for line, and verify all three converge.
State Xavier vs He initialization, derive the √(2/fan_in) constant, and explain why a poor init breaks training.
Diagnose underfitting vs overfitting from a training/validation loss curve.
Train a 2-layer MLP on MNIST to ≥ 95% test accuracy.

The Neuron

A neuron is a dot product, a bias, and an activation.

A neuron is the unit of computation. It does three things:

Take a vector of inputs x = (x₁, …, x_n).
Compute a weighted sum z = w·x + b = Σᵢ wᵢ xᵢ + b.
Pass z through a non-linear activation f, returning y = f(z).

y = activation(w · x + b)

That's it. That's the entire neuron. Geometrically, w · x = ‖w‖‖x‖cos θ, so the dot product measures alignment between w and x. The bias b shifts the activation threshold. The activation f introduces non-linearity.

A neuron with no activation is just a linear function of its inputs. Stacking gains nothing.
A neuron with f = sigmoid and a 0/1 target is logistic regression.
A whole layer is many neurons sharing the same input. Stack them as columns of a weight matrix and the dot products parallelize into one matmul.

One neuron: dot product w · x (a weighted sum — Σ means sum(...)), plus a bias b, then a non-linear function f. Stack d_out of these side-by-side and pack the weights into a matrix: you get a linear layer.

From neuron to linear layer

A linear layer with d_in inputs and d_out outputs is d_out neurons sharing the same input vector. Pack all the weights into a matrix W (a 2D array of floats, shape (d_in, d_out)) and all the biases into a vector b (a 1D array of floats, length d_out). For a batch X (shape (B, d_in)) of B inputs at once:

Y = X W + b (B, d_in) @ (d_in, d_out) + (d_out,) → (B, d_out)

The bias broadcasts across the batch. Same matmul rules as Day 1.

A 4096 × 4096 linear has ~16.8M trainable parameters. LLaMA-7B has ~96 such matrices in the FFN blocks alone.

What about bias terms?

In modern LLMs (LLaMA, Mistral, Qwen), the linear layers in the FFN often have no bias. Two reasons:

Scale. When d_model = 4096, an extra 4096-element vector per linear is rounding error in parameter count, but it's another fragile thing to learn. Empirically, removing biases doesn't hurt loss.
LayerNorm absorbs them. A LayerNorm right before the linear already injects a learned mean shift; another bias is redundant.

GPT-2 had biases. Most post-2022 architectures don't. We'll keep biases today because we're learning, but it's worth knowing the convention shifted.

Activations

The source of expressivity is the non-linearity.

Why we need non-linearity

Stack two linear layers with no activation between them:

y = (X W₁) W₂ = X (W₁ W₂)

Matrix multiplication is associative — so W₁ W₂ is just another matrix. Stacking 100 linear layers without activations collapses to one equivalent linear layer. The whole network can only learn linear functions of its inputs.

The proof is one line. Concrete demo:

import numpy as np
W1 = np.random.randn(4, 16)
W2 = np.random.randn(16, 3)
x  = np.random.randn(4)
print(((x @ W1) @ W2))         # two linear ops
print(x @ (W1 @ W2))           # one linear op — same result
# array agreement: tiny float drift only

Adding any non-linearity between the layers — max(0, ·), tanh, anything — breaks associativity and the network can suddenly approximate functions a single linear layer can't.

Universal Approximation Theorem (Hornik 1989, Cybenko 1989). A feed-forward network with at least one hidden layer and a non-polynomial activation can approximate any continuous function on a compact set, given enough neurons. The theorem doesn't say how many neurons — it might need a billion — but it justifies the architecture. Depth × non-linearity = expressivity.

The cast of activations

Each activation is a scalar function f: ℝ → ℝ applied element-wise to the layer pre-activation z. Their derivatives matter, because backward will multiply by them.

Six activations to know. The smoothness of GELU and SiLU is mostly aesthetic — empirically they each give a small but consistent improvement over ReLU on Transformer FFNs.

Activation	Formula	Derivative	Output range	Used in
ReLU	`max(0, x)`	`1 if x > 0 else 0`	`[0, ∞)`	The default since 2012. Vision, vanilla MLPs, GPT-1/2 FFN.
Leaky ReLU	`max(αx, x)`, α≈0.01	`1 if x > 0 else α`	`(−∞, ∞)`	When ReLU "dies" (neurons stuck at zero).
GELU	`x · Φ(x)` (Φ = Gaussian CDF)	`Φ(x) + x · φ(x)`	`≈[−0.17, ∞)`	BERT, GPT-2, GPT-3. Smooth ReLU.
SiLU / Swish	`x · σ(x)`	`σ(x) + x·σ(x)(1−σ(x))`	`≈[−0.28, ∞)`	LLaMA, Mistral, Qwen FFN. Often inside SwiGLU.
tanh	`(eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)`	`1 − tanh²(x)`	`(−1, 1)`	RNNs, GRU/LSTM gates, sometimes policy heads.
Sigmoid	`1 / (1 + e⁻ˣ)`	`σ(x)(1 − σ(x))`	`(0, 1)`	Final layer for binary classification; gates inside LSTMs.
Softmax	`eˣⁱ / Σⱼ eˣʲ`	vector-valued (see below)	row sums to 1	The output of every classifier. The weights inside attention.

The vanishing gradient problem (and why ReLU saved deep learning)

When you stack tanh activations and the pre-activations get large, tanh'(z) = 1 − tanh²(z) ≈ 0 for |z| > 2. Multiplying small gradients through 50 layers of backward gives you something effectively zero. Early-layer parameters stop receiving useful updates. Training stalls.

ReLU's derivative is exactly 1 for x > 0 — no decay. Stacking 100 ReLU layers, the gradient still has full magnitude in the active half. This single change, plus better initialization (next section), is what made deep nets viable. Before 2012, "deep" meant 3-5 layers; after, hundreds.

Softmax is the universal output

For multi-class classification with C classes, the network outputs C real numbers — logits — one per class. Softmax converts them into a probability distribution:

softmax(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ

Two important properties:

Each output is in (0, 1) and the row sums to 1. So the row is a valid probability distribution.
Translation-invariant. softmax(z) = softmax(z − max(z)). We always subtract the max in practice for numerical stability — otherwise eᶻ overflows.

def softmax_stable(z):
    z = z - z.max(axis=-1, keepdims=True)   # numerical safety
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

Softmax has a vector-valued derivative (a Jacobian, not a scalar). We never compute it directly — we always pair it with cross-entropy loss, and the product simplifies to something gorgeous. Two sections from now.

The Forward Pass

Two matmuls, one ReLU, one softmax, one cross-entropy.

Let's build a 2-layer MLP for classification:

in_dim = 4 # features per example hidden_dim = 16 out_dim = 3 # classes batch_size = 32

Forward pass, with every intermediate shape labeled:

X (B, in_dim) = (32, 4) # batch of inputs z₁ = X W₁ + b₁ (B, hidden_dim) = (32, 16) a₁ = ReLU(z₁) (B, hidden_dim) = (32, 16) z₂ = a₁ W₂ + b₂ (B, out_dim) = (32, 3) # logits p = softmax(z₂) (B, out_dim) = (32, 3) # row sums to 1 L = − (1/B) Σᵢ log p_{i, yᵢ} # scalar

That's the whole forward pass: two matmuls, one ReLU, one softmax, one cross-entropy reduction. Every transformer FFN follows the same shape — they just have bigger numbers.

Parameter count for this toy net: (4·16 + 16) + (16·3 + 3) = 80 + 51 = 131 trainable scalars.

Pin this picture. Trace each tensor's shape with your finger. The transformer FFN is the same diagram with d_model = 4096 and a smooth activation in the middle.

def forward(X):
    z1 = X @ W1 + b1        # (B, hidden_dim)
    a1 = np.maximum(0, z1)  # ReLU
    z2 = a1 @ W2 + b2       # (B, out_dim) — logits
    return z1, a1, z2       # cache for backward

We return the intermediates z1 and a1, not just the output z2. The backward pass needs them. This is the same trick autograd does under the hood — store activations during forward so you can multiply by them on the way back.

Memory footprint of the forward pass

Activations dominate VRAM during training. Quick math for a Transformer block at B=8, T=2048, D=4096:

hidden activation a₁ shape: (B, T, 4D) = (8, 2048, 16384) size in fp16: ~537 MB per layer

For 32 layers stacked, that's ~17 GB of activations alone, just for the FFN intermediates. This is why gradient checkpointing (recomputing some activations on backward instead of storing them) is everywhere in LLM training. We'll meet it on Day 10.

Why this matters for inference

The transformer FFN sublayer — the one you'll actually be serving — is exactly this two-linear-layer MLP, just with bigger dimensions. In LLaMA-7B, the FFN expands to 4 × d_model = 16384 hidden units and uses SiLU (a smooth ReLU lookalike). Every inference request runs:

gate = SiLU(X W_gate) # (B, T, 16384) ← one big matmul up = X W_up # (B, T, 16384) ← second big matmul out = (gate ⊙ up) W_down # (B, T, 4096) ← third big matmul ← three matmuls per layer, ~two-thirds of all FLOPs in a forward pass

Params vs FLOPs. One FFN layer in a 7B model has ~3 × 4096 × 16384 ≈ 200M parameters. At batch size 1 (common during inference), every parameter is loaded from memory but used for just a handful of multiplications — so FFN layers are memory-bandwidth bound, not compute bound. Reducing FFN weights (quantization, pruning) is the single biggest lever on serving cost. You'll need to understand this MLP before any of that makes sense.

The Loss

Cross-entropy on raw logits, every time.

For classification, cross-entropy is the universal loss. From information theory (Day 1, Shannon's entropy): the average number of bits you need to encode the true distribution if you assume the model's distribution. Lower is better; zero is perfection.

Formally, for one example with true class y and predicted distribution p:

CE(p, y) = − Σ_c 1[c = y] · log p_c = − log p_y

That log p_y is the only term that survives because all other indicators are zero. Average over the batch:

L = − (1/B) Σᵢ log p_{i, yᵢ} # scalar

If the model is perfectly confident on the correct class (p_y = 1), the loss is −log 1 = 0. If it's spreading mass evenly across C classes (p_y = 1/C), the loss is log C — a useful baseline. For C = 10 (MNIST), log 10 ≈ 2.30. A randomly initialized 10-class classifier should output a loss of ~2.3 on the first batch. If yours doesn't, you've already got a bug.

def cross_entropy_loss(logits, y):
    # logits: (B, C); y: (B,) integer labels
    z = logits - logits.max(axis=-1, keepdims=True)   # stability
    log_sum_exp = np.log(np.exp(z).sum(axis=-1, keepdims=True))
    log_p = z - log_sum_exp                            # (B, C)
    nll   = -log_p[np.arange(len(y)), y]               # (B,)
    return nll.mean(), softmax(logits)

Practical note. Always compute cross-entropy on raw logits (F.cross_entropy(logits, y) in PyTorch). Computing softmax then log separately loses precision — log(softmax(z)) is mathematically the same but numerically worse. The fused log_softmax + nll_loss (which cross_entropy uses) is the right call.

Why softmax + cross-entropy gives the cleanest possible gradient

Here's the magic. When you compose p = softmax(z) and L = − log p_y, the derivative simplifies dramatically:

∂L/∂zᵢ = pᵢ − 1[i = y]

In words: the gradient of the loss with respect to the logits is just the predicted distribution minus the one-hot true label. No softmax Jacobian. No log-derivative. Just (p − y).

For a batch, divide by B because we averaged the loss:

dz₂ = (p − one_hot(y)) / B # (B, out_dim)

This is the gradient that starts the entire backward walk. Memorize it. It's the reason every classifier ever has trained as fast as it does.

The one-line derivation:

L = − log p_y where p_y = e^{z_y} / Σⱼ e^{z_j} ∂p_y / ∂z_i = p_y · (1[i = y] − p_i) # softmax Jacobian ∂L / ∂z_i = − (1/p_y) · ∂p_y/∂z_i # log-chain = − (1[i = y] − p_i) = p_i − 1[i = y] # ✓

This clean form is one reason cross-entropy is the universal classification loss. Squared error on probabilities (mean-squared error on softmax outputs) gives a sloppier gradient with (p − y) · p · (1 − p) factors that vanish at the extremes — bad for optimization.

The Backward Pass

Six lines of chain rule. Derive them once.

We just got dz₂ = (p − one_hot(y)) / B. From there, the chain rule goes layer by layer.

The cheat sheet — derivatives of every primitive in our forward pass:

Forward op	Local derivative	Notes
`z₂ = a₁ W₂ + b₂`	`∂L/∂W₂ = a₁ᵀ · dz₂`, `∂L/∂b₂ = Σ dz₂`, `∂L/∂a₁ = dz₂ · W₂ᵀ`	Standard linear layer backward
`a₁ = ReLU(z₁)`	`∂a₁/∂z₁ = 1[z₁ > 0]`	Element-wise mask
`z₁ = X W₁ + b₁`	`∂L/∂W₁ = Xᵀ · dz₁`, `∂L/∂b₁ = Σ dz₁`, `∂L/∂X = dz₁ · W₁ᵀ`	Same shape; upstream is dz₁

Apply them in reverse. With dz₂ already in hand:

dW₂ = a₁ᵀ @ dz₂ (hidden, out) db₂ = dz₂.sum(axis=0) (out,) da₁ = dz₂ @ W₂ᵀ (B, hidden) dz₁ = da₁ * (z₁ > 0) (B, hidden) # ReLU mask dW₁ = Xᵀ @ dz₁ (in, hidden) db₁ = dz₁.sum(axis=0) (hidden,)

Six lines. Every gradient autograd would compute, by hand. Match them to your forward pass shapes — there's exactly one transposition per matmul, and biases sum across the batch axis (because each example contributed independently to the bias).

The full backward pass on a 2-layer MLP. Six matmul-or-reduction lines. Verify yourself: every gradient PyTorch's autograd computes is identical.

Why the bias gradient is a sum

Each row i of the batch contributes bᵢ independently. So ∂L/∂b = Σᵢ ∂L/∂(z + b)ᵢ = Σᵢ dzᵢ. The gradient w.r.t. the broadcasted dimension is the sum across that dimension. This is a general rule — the gradient w.r.t. a broadcasted operand is the reduction of the gradient over the broadcast axes. PyTorch and MLX both implement this correctly.

Why the gradient is `Xᵀ · dz`

For z = X · W, we have zₖₗ = Σⱼ Xₖⱼ · Wⱼₗ. So ∂zₖₗ/∂Wᵢⱼ = Xₖᵢ · 1[j = ℓ]. Chain rule:

∂L/∂Wᵢⱼ = Σₖₗ (∂L/∂zₖₗ)·(∂zₖₗ/∂Wᵢⱼ) = Σₖ Xₖᵢ · dzₖⱼ = (Xᵀ · dz)ᵢⱼ

In one line: transpose the input, matmul with the upstream gradient, get the weight gradient. Same shape as W. Dimension check: (d_in, B) @ (B, d_out) = (d_in, d_out) ✓.

Initialization

A poor init kills training before step 1.

You can have the right architecture, the right loss, the right optimizer, and still training won't converge — because the weights started at the wrong scale.

The intuition: a forward pass through L linear layers multiplies the input variance by roughly (σ²·n)ᴸ where σ is the weight scale and n is the fan-in. If σ²·n > 1, activations explode (NaNs by layer 30). If σ²·n < 1, activations vanish (every neuron outputs roughly zero by layer 30, no gradient flows back). You need σ²·n ≈ 1.

There's a similar story on the backward pass. With activation f, the variance of gradients flowing backward through the layer is multiplied by (σ²·n_out · 𝔼[f'(z)²]). So the "right" σ depends on the activation function:

Activation	Best init (variance)	Common name	Used with
`tanh`, `sigmoid`	`σ² = 1 / fan_in`	Xavier / Glorot	RNNs, older models
`ReLU`, leaky ReLU	`σ² = 2 / fan_in`	He / Kaiming	CNNs, MLPs, transformer FFN
Generic linear / no activation	`σ² = 1 / fan_in`	Xavier-uniform	Final classifier head
Modern transformer linears	`σ ≈ 0.02` (constant)	GPT-2 init	LLaMA, GPT-2/3, etc.

Why 2 / fan_in for ReLU? Because ReLU zeroes out half the activations in expectation, so to keep the active half's variance unit, you need to double the initial weight variance. Xavier was derived for symmetric tanh; He generalized it to ReLU's asymmetry.

# He init for ReLU networks
W1 = rng.standard_normal((in_dim, hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim)               # bias init = 0 (always)

Bias init is always zero. There's nothing asymmetric about a bias; starting it at zero gives the network a clean reference and lets the data move it.

Modern transformer trick. GPT-2 and most descendants use a constant σ = 0.02 for all linears, and additionally scale residual-branch outputs by 1/√(2L) (where L is layer count) to prevent output magnitudes from blowing up as you stack more blocks. We'll see this on Day 12 when we build a real GPT.

A botched init is among the most common training-failure modes — and one of the easiest to debug. If your loss starts at 50 instead of log(C), your init is wrong.

The Update Rule

SGD: w ← w − η · ∇w. Repeat.

Given gradients, stochastic gradient descent moves each parameter in the negative-gradient direction:

W ← W − η · dW b ← b − η · db

η (the Greek letter eta) is the learning rate — how large a step to take. Concrete numbers (η = 0.05, dW₂[0,0] = 0.13):

W₂[0,0]_new = W₂[0,0]_old − 0.05 · 0.13 = W₂[0,0]_old − 0.0065

Each step nudges every parameter by η · grad. Repeat thousands of times.

The loss surface as a bowl. A too-small learning rate crawls. A too-large one bounces past the minimum and diverges. Just right: smooth descent. This is why the default learning rate for SGD on MLPs is 0.01–0.1, and why you always check the initial loss curve first.

Learning rate intuition. Too low: training crawls. Too high: loss bounces or NaNs out — you're stepping past the minimum. Empirical default for SGD on small MLPs: η ∈ [0.01, 0.1]. For Adam (Day 4): η ∈ [1e-4, 3e-4]. Loss curves diverging by step 2 = your η is way too high.

W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2

That's vanilla SGD. Day 4 replaces it with momentum, then Adam, then AdamW — all small variations on this single line.

Mini-batches: why we don't use the full dataset

Three options for choosing how many examples per gradient step:

Batch gradient descent — full dataset, one step per epoch. Slow per step, smooth gradient.
Stochastic gradient descent — one example per step. Fast per step, very noisy gradient.
Mini-batch SGD — B examples per step (typically 32–512). The compromise everyone uses.

Mini-batches give you three benefits at once:

Statistical: noise helps escape sharp local minima.
Computational: GPUs are massively parallel; doing one example wastes 99% of the silicon. Doing 256 in parallel keeps the kernels warm.
Memory: full-batch on a 1B-example dataset would need terabytes of RAM. B = 256 fits.

Modern LLM training typically uses batch sizes of 1–4 million tokens (collected via gradient accumulation across many GPUs) — the noise is so small that AdamW essentially behaves like full-batch.

NumPy MLP — End to End

Type it. Run it. Watch loss drop.

This is the lesson. Type it. Run it. Watch loss drop.

import numpy as np

rng = np.random.default_rng(42)

# ---- Hyperparameters ----
in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32
lr = 0.05
n_steps = 2000

# ---- Data: 3 Gaussian blobs in 4D ----
N = 1024
class_centers = rng.standard_normal((out_dim, in_dim)) * 3.0
y = rng.integers(0, out_dim, N)
X = rng.standard_normal((N, in_dim)) + class_centers[y]

# ---- Init: He for ReLU, zero bias ----
W1 = rng.standard_normal((in_dim,    hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim)
W2 = rng.standard_normal((hidden_dim, out_dim))   * np.sqrt(2.0 / hidden_dim)
b2 = np.zeros(out_dim)

def relu(z): return np.maximum(0, z)
def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z); return e / e.sum(axis=-1, keepdims=True)

def forward(X):
    z1 = X @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    return z1, a1, z2

def cross_entropy(logits, y):
    p = softmax(logits)
    log_p_correct = np.log(p[np.arange(len(y)), y] + 1e-12)
    return -log_p_correct.mean(), p

def backward(X, y, z1, a1, p):
    B = X.shape[0]
    dz2 = p.copy()
    dz2[np.arange(B), y] -= 1.0
    dz2 /= B
    dW2 = a1.T @ dz2
    db2 = dz2.sum(axis=0)
    da1 = dz2 @ W2.T
    dz1 = da1 * (z1 > 0)
    dW1 = X.T @ dz1
    db1 = dz1.sum(axis=0)
    return dW1, db1, dW2, db2

# ---- Training loop ----
for step in range(n_steps):
    idx = rng.choice(N, batch_size, replace=False)
    Xb, yb = X[idx], y[idx]

    z1, a1, z2 = forward(Xb)
    loss, p = cross_entropy(z2, yb)
    dW1, db1, dW2, db2 = backward(Xb, yb, z1, a1, p)

    W1 -= lr * dW1;  b1 -= lr * db1
    W2 -= lr * dW2;  b2 -= lr * db2

    if step % 200 == 0:
        acc = (z2.argmax(-1) == yb).mean()
        print(f"step {step:4d}   loss {loss:.4f}   train-acc {acc:.3f}")

What you should see — three rules of thumb that work for almost any MLP:

Initial loss ≈ log(out_dim). Here log(3) ≈ 1.10. If your first step prints something wildly different (50? NaN?) your init is broken.
Loss drops monotonically over the first 100 steps. Not strictly — there's batch noise — but the trend is clearly down.
Loss approaches 0 and accuracy approaches 1.0 by step ~1000 on this synthetic dataset. Linearly separable clusters; a 2-layer MLP solves it trivially.

If any of those three are off, stop and debug:

Loss NaN: lower lr, double-check init.
Loss flat: probably lr too low or weights stuck at zero (no init or all-zero init).
Loss climbing: lr way too high, you're bouncing past the minimum.

PyTorch — The Easy Way

Same network. Autograd does the backward.

Identical math, autograd does the backward.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, out_dim)

    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

model = MLP()
opt   = torch.optim.SGD(model.parameters(), lr=0.05)

# (Generate Gaussian-blob data as in the NumPy version.)
X = torch.randn(1024, in_dim) + class_centers[y]
y = torch.from_numpy(y)

for step in range(2000):
    idx = torch.randperm(len(X))[:batch_size]
    Xb, yb = X[idx], y[idx]

    logits = model(Xb)
    loss   = F.cross_entropy(logits, yb)            # softmax + NLL fused

    opt.zero_grad()
    loss.backward()
    opt.step()

    if step % 200 == 0:
        acc = (logits.argmax(-1) == yb).float().mean()
        print(f"step {step:4d}   loss {loss.item():.4f}   train-acc {acc:.3f}")

Same hyperparameters, same data, same outcome. Side-by-side, the differences are mechanical:

What	NumPy	PyTorch
Parameter container	`W1, b1, W2, b2` globals	`model.parameters()` from `nn.Module`
Init	manual `np.sqrt(2/fan_in)`	automatic (Kaiming uniform default)
Forward	`forward(X)` returning intermediates	`model(X)` returning logits
Loss	hand-written cross-entropy	`F.cross_entropy(logits, y)` (fused)
Backward	`backward(...)` returning grads	`loss.backward()`
Update	`W1 -= lr * dW1; …`	`opt.step()`
Zero grads	n/a	`opt.zero_grad()` before `backward`

nn.Module does three things for you: tracks parameters, registers them with the optimizer, routes the forward pass through registered submodules. Everything else is the same.

The four-line heartbeat — opt.zero_grad() → loss.backward() → opt.step() (preceded by the forward) — is what every PyTorch training loop ever written looks like. Same on Day 9 (tiny GPT), Day 10 (mixed precision), Day 11 (distributed). Memorize it.

MLX — The Apple-Silicon Equivalent

Functional autograd on unified memory.

For Apple Silicon, MLX is the native equivalent. The structure is functional — gradients are returned from a function, not deposited on .grad.

import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim

mx.random.seed(42)

class MLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, out_dim)

    def __call__(self, x):
        return self.fc2(nn.relu(self.fc1(x)))

model = MLP(4, 16, 3)
opt   = optim.SGD(learning_rate=0.05)

X = mx.random.normal((1024, 4))
y = mx.random.randint(0, 3, (1024,))

def loss_fn(model, x, y):
    return nn.losses.cross_entropy(model(x), y, reduction="mean")

loss_and_grad = nn.value_and_grad(model, loss_fn)

for step in range(2000):
    idx = mx.random.permutation(len(X))[:32]
    Xb, yb = X[idx], y[idx]

    loss, grads = loss_and_grad(model, Xb, yb)
    opt.update(model, grads)
    mx.eval(model.parameters(), opt.state)        # cap deferred graph

    if step % 200 == 0:
        print(f"step {step:4d}   loss {loss.item():.4f}")

Three things to notice vs PyTorch:

No .grad on tensors. loss_and_grad returns the loss and a parameter pytree of gradients in one call.
No zero_grad. Functional API, every call returns fresh grads.
mx.eval after each step. MLX is lazy — without periodic eval, the deferred computation graph grows unboundedly.

Day 2 covered the lazy-eval mental model in depth; we're just applying it.

Diagnostics

Underfitting vs overfitting, by curve shape.

Train and validate. Plot both losses against step.

Three curve shapes. Most training problems show up here within 100 steps — read the curve before you tune anything else.

Underfitting. Both training and validation loss are high and flat-ish. The model can't capture the pattern — it's too small, the lr is too low, or the features are inadequate. Counters: bigger model, longer training, better features.

Overfitting. Training loss keeps dropping; validation loss bottoms out and climbs back up. The model has memorized the training set and is now noise-fitting. Counters: more data (the cleanest fix), regularization (weight decay, dropout), early stopping at the validation minimum, smaller model.

Just right. Both losses drop, validation tracks training, the gap is small. This is what you want.

LLM pre-training is a curious case: the data set is so much larger than the model's capacity that overfitting is rare — the model never sees the same example twice. The opposite problem dominates: how to train more efficiently on a fixed compute budget. We'll get there on Day 8 when we cover scaling laws.

Capacity: depth vs width

How much compute should you put where? Two extremes:

Wider, shallower (e.g., a 1-layer net with 100k hidden neurons). Easy to optimize, every neuron sees the input. Limited expressivity for hierarchical features.
Deeper, narrower (e.g., a 50-layer net with 256 neurons). Harder to optimize (vanishing gradients, residual connections become essential), but each layer composes the previous one's features into higher-level abstractions.

Empirically, depth wins for most real tasks — but only if you can train it. Modern transformers go ~30-100 layers deep with d_model = 4096. The reason this works: residual connections (Day 7), pre-LayerNorm (Day 7), and well-tuned init (this lesson). Without all three, training a 30-layer transformer fails.

Quick capacity rules of thumb:

MLP (this lesson): 1-3 hidden layers, 16-512 hidden units. Plenty for tabular data.
CNN (vision, not our path): tens of layers, residual connections.
Transformer (our path): 12-100 layers, d_model 768-8192, 4 · d_model FFN width.

Exercise

Type, perturb, port, debug.

Run these in a Python REPL or notebook. Type the code; don't copy-paste. The point is to feel the shapes.

1. Type the NumPy MLP

Run it. Watch loss drop. Then perturb:

Set lr = 5.0. What does the loss curve look like? Why?
Set lr = 1e-5. How many steps before any progress?
Set hidden_dim = 1. Why does accuracy plateau at ~33%?
Skip the He init (use plain randn). Does training still converge? How many steps does it take?

2. Add a third hidden layer

Extend the NumPy version to d_in → 16 → 16 → 3. Derive the new backward by hand on paper. Implement it, train it, verify it converges.

3. Replace ReLU with tanh

In both NumPy and PyTorch versions. The derivative of tanh(z) is 1 − tanh²(z). Update the NumPy backward; PyTorch handles it for you. Compare convergence speeds — tanh is typically slower in deep nets.

4. Why does the loss start at ~1.10?

Predict on paper, then verify. (Hint: random init means roughly uniform predictions across 3 classes.)

5. Real data — MNIST

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

train_ds = datasets.MNIST('./data', train=True, download=True,
                          transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

Flatten the 28×28 image to a 784-vector. Build a 784 → 256 → 10 MLP. Train for 5 epochs. Target test accuracy: ≥ 95%. Plot training and validation loss against step. Is your network overfitting? Add weight decay (SGD(..., weight_decay=1e-4)) and re-plot.

6. (Mac) Port to MLX

Translate your MNIST MLP from PyTorch to MLX. Compare wall-clock training time. (M-series Macs typically train a 256-hidden MLP on MNIST faster than CUDA on integrated GPUs and on par with mid-range discrete GPUs.)

7. Pick the bug

Five common training-loop bugs. Read the symptoms. Predict which line caused each:

Symptom	Likely bug
Loss is `NaN` after step 1	(a) (b) (c) (d) (e)
Loss is constant for 1000 steps	(a) (b) (c) (d) (e)
Loss decreases on train but val accuracy is 10% (random)	(a) (b) (c) (d) (e)
Train accuracy is 99%, val is 60%	(a) (b) (c) (d) (e)
Loss starts at ~50 instead of ~2.30	(a) (b) (c) (d) (e)

(a) Forgot opt.zero_grad() (grads accumulate)
(b) Used lr = 100
(c) Used lr = 1e-9
(d) Forgot to scale init by √(2/n) → init too large
(e) Forgot to shuffle the dataset

Go deeper.

Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.

YouTube · Free · 2.5h

Karpathy — building micrograd

The single best resource for today's lesson. Builds an autograd-equipped neural network from scratch in pure Python from absolute zero.

Watch on YouTube

GitHub · 150 lines

karpathy/micrograd

Scalar-valued autograd in 150 lines. Read end to end after the video.

Open repo

Free book · Online

Nielsen — Neural Networks and Deep Learning

Chapters 1-2. The clearest backprop introduction ever written.

Read online

Free book · Reference

Goodfellow et al. — Deep Learning Ch. 6

Deep Feedforward Networks. The canonical textbook chapter.

Read online

Book · Manning

Raschka — Build an LLM From Scratch

Chapter 2. Closest book to this curriculum.

Open page

Video · Visual

3Blue1Brown — What is backpropagation really doing?

Episode 4 of the neural networks series. Animations explain the chain rule better than equations.

Watch on YouTube

Browser · Interactive

TensorFlow Playground

Train a 2-layer net in your browser. Watch decision boundaries form. Pure intuition.

Open playground

Browser · Karpathy

ConvNetJS demos

In-browser net visualizer. Watch a network learn live.

Open demo

Paper · 1986

Rumelhart, Hinton, Williams — Backprop

The Nature paper that popularized backprop. Read for historical context.

Open paper

Paper · 2015

He et al. — Delving Deep into Rectifiers

Where He init comes from. Section 2.2 has the variance derivation.

Read on arXiv

Paper · 2010

Glorot & Bengio — Difficulty of training deep nets

Where Xavier init comes from. The paper that started the init revolution.

Open paper

Paper · 2016

Hendrycks & Gimpel — GELU

Where the activation in GPT-2 / BERT comes from.

Read on arXiv

Paper · 2017

Ramachandran, Zoph, Le — Swish

Where SiLU/Swish comes from. Used in LLaMA, Mistral, Qwen FFN.

Read on arXiv

Blog · Visual

Olah — Calculus on Computational Graphs

Christopher Olah's clean visual explanation of forward vs reverse mode autodiff.

Read post

Source · PyTorch

torch.nn.Linear source

The actual nn.Linear you use. About 100 lines, including init.

Open file

Source · MLX

mlx.nn.Linear source

MLX's equivalent linear layer. Same architecture, slightly different conventions.

Open file

Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

Two stacked linear layers Y = (X W₁) W₂ are equivalent to a single linear layer with weight matrix W = ?. Why does this break with a non-linearity in between?
Write out the derivative of softmax + cross-entropy w.r.t. the logits. Why is this expression so much simpler than computing each piece separately?
He initialization scales weights by √(2/fan_in). Where does the 2 come from?
In our manual backward, why does db = dz.sum(axis=0)? What's special about a bias?
If your training loss is exactly log(C) for many steps and never moves, what's the most likely bug?
Why is bias init always zero? Why isn't weight init also zero?
Predict the initial loss for a 10-class classifier with no training, randomly initialized.
Your network's final layer outputs logits. Why do we never apply a softmax before the loss in PyTorch?
You scale your input by 1000× without changing weights. What happens to the first layer's pre-activation? To the gradient flowing back? Which init constant should you adjust to compensate?
The Universal Approximation Theorem says a 1-hidden-layer network can approximate any function. Why do we use deep networks anyway?

Neural Networks From Scratch

A Transformer is, fundamentally, MLPs plus attention.

Learning objectives

Two ops, repeated. Forward builds. Backward attributes.

A neuron is a dot product, a bias, and an activation.

From neuron to linear layer

What about bias terms?

The source of expressivity is the non-linearity.

Why we need non-linearity

The cast of activations

The vanishing gradient problem (and why ReLU saved deep learning)

Softmax is the universal output

Two matmuls, one ReLU, one softmax, one cross-entropy.

Memory footprint of the forward pass

Why this matters for inference

Cross-entropy on raw logits, every time.

Why softmax + cross-entropy gives the cleanest possible gradient

Six lines of chain rule. Derive them once.

Why the bias gradient is a sum

Why the gradient is Xᵀ · dz

A poor init kills training before step 1.

SGD: w ← w − η · ∇w. Repeat.

Mini-batches: why we don't use the full dataset

Type it. Run it. Watch loss drop.

Same network. Autograd does the backward.

Functional autograd on unified memory.

Underfitting vs overfitting, by curve shape.

Capacity: depth vs width

Type, perturb, port, debug.

1. Type the NumPy MLP

2. Add a third hidden layer

3. Replace ReLU with tanh

4. Why does the loss start at ~1.10?

5. Real data — MNIST

6. (Mac) Port to MLX

7. Pick the bug

Go deeper.

Ten questions before moving on.

Why the gradient is `Xᵀ · dz`