LLM Inference Engineer · Day 03
Day 03 · Week 1 · Foundations
🧠

Neural Networks From Scratch

Build an MLP by hand: forward, loss, manual backward pass. Then the same thing in PyTorch and MLX. Every gradient derived. Every shape labeled.

Time~150 min
DifficultyMedium
PrerequisiteDay 1 + 2
Why This Lesson

A Transformer is, fundamentally, MLPs plus attention.

If the multi-layer perceptron isn't in your bones — if you can't write its forward pass on a whiteboard, derive its backward by hand, predict its shapes, and explain why ReLU rescued deep learning — then transformers will feel mystical for the rest of the curriculum. They aren't. They're MLPs with attention bolted on.

So today we kill the mystery. We do it the slow way first — pure NumPy, no autograd, every gradient derived from the chain rule on Day 1 — and then we re-do it in PyTorch and MLX. The slow way takes 30 lines and proves the libraries aren't magic. The fast way takes 8 lines and is exactly what every modern LLM training loop does.

Concrete reference points for what builds on this lesson:

  • Day 4 — backprop & optimizers — generalizes today's by-hand backward to arbitrary graphs and replaces SGD with Adam/AdamW.
  • Day 6 — attention is, in its inner loop, an MLP with one extra step (the softmax).
  • Day 7 — a Transformer block is Attention → MLP → Attention → MLP → ..., where each MLP is literally the network you build today, just with d_model = 4096 instead of 16.
  • Days 9–10 — when we train tiny GPT, the optimizer touches exactly the kind of state_dict we'll meet today.

If anything below feels fuzzy by the end, the commit is: re-type, don't copy-paste, the NumPy MLP. Type until loss drops. Don't move on until you can re-derive the backward pass on paper.

Learning objectives

  1. Define a neuron in one line of math and a linear layer as a batched matrix multiplication.
  2. Predict the output shape of every layer in a 2- or 3-layer MLP.
  3. Explain why non-linear activations are essential — and back the explanation with a concrete demo that two stacked linears collapse to one.
  4. Compare ReLU, Leaky ReLU, GELU, SiLU, tanh, sigmoid: shape, derivative, when each is preferred, and which one modern LLMs use.
  5. Derive the gradients of an MLP w.r.t. every parameter by hand. Show that softmax + cross-entropy collapses to (p − y)/B.
  6. Implement a 2-layer MLP from scratch in NumPy, train it on a synthetic dataset, watch loss drop.
  7. Reproduce the same model in PyTorch and MLX, line for line, and verify all three converge.
  8. State Xavier vs He initialization, derive the √(2/fan_in) constant, and explain why a poor init breaks training.
  9. Diagnose underfitting vs overfitting from a training/validation loss curve.
  10. Train a 2-layer MLP on MNIST to ≥ 95% test accuracy.
The One-Picture Mental Model

Two ops, repeated. Forward builds. Backward attributes.

Every neural network — including a 70B Transformer — fits this picture:

input x (B, d_in) linear XW₁ + b₁ → (B, h) activation ReLU(z) → (B, h) linear a₁W₂ + b₂ → logits loss cross_entropy → scalar backward: chain rule deposits dL/dW₁, dL/dW₂, dL/db₁, dL/db₂ into .grad batch of inputs non-linearity scalar to differentiate SGD: w ← w − η · ∇w. Repeat.
The whole story: linear → non-linearity → linear → loss. Forward computes values; backward computes gradients via the chain rule; SGD takes one step. Repeat thousands of times.

A 70B-parameter LLM is the same picture with attention slotted in between linear blocks and D = 8192. Don't let the scale fool you — the algorithm is what you're about to write.

The Neuron

A neuron is a dot product, a bias, and an activation.

A neuron is the unit of computation. It does three things:

  1. Take a vector of inputs x = (x₁, …, x_n).
  2. Compute a weighted sum z = w·x + b = Σᵢ wᵢ xᵢ + b.
  3. Pass z through a non-linear activation f, returning y = f(z).
y = activation(w · x + b)

That's it. That's the entire neuron. Geometrically, w · x = ‖w‖‖x‖cos θ, so the dot product measures alignment between w and x. The bias b shifts the activation threshold. The activation f introduces non-linearity.

  • A neuron with no activation is just a linear function of its inputs. Stacking gains nothing.
  • A neuron with f = sigmoid and a 0/1 target is logistic regression.
  • A whole layer is many neurons sharing the same input. Stack them as columns of a weight matrix and the dot products parallelize into one matmul.
x₁ x₂ x₃ w₁ w₂ w₃ b Σ z = w·x + b z f(z) activation y output y = f(z) y = f(w₁x₁ + w₂x₂ + w₃x₃ + b) = f(w · x + b)
One neuron: dot product w · x (a weighted sum — Σ means sum(...)), plus a bias b, then a non-linear function f. Stack d_out of these side-by-side and pack the weights into a matrix: you get a linear layer.

From neuron to linear layer

A linear layer with d_in inputs and d_out outputs is d_out neurons sharing the same input vector. Pack all the weights into a matrix W (a 2D array of floats, shape (d_in, d_out)) and all the biases into a vector b (a 1D array of floats, length d_out). For a batch X (shape (B, d_in)) of B inputs at once:

Y = X W + b (B, d_in) @ (d_in, d_out) + (d_out,) → (B, d_out)

The bias broadcasts across the batch. Same matmul rules as Day 1.

Y = X W + b X (B, d_in) B rows @ W (d_in, d_out) trainable + b (d_out,) broadcasts = Y (B, d_out) B rows Param count: d_in · d_out + d_out trainable scalars per layer. For d_in = 4096, d_out = 4096: ~16.8M params. Stack 32 → 540M. That's a small LLM.
A 4096 × 4096 linear has ~16.8M trainable parameters. LLaMA-7B has ~96 such matrices in the FFN blocks alone.

What about bias terms?

In modern LLMs (LLaMA, Mistral, Qwen), the linear layers in the FFN often have no bias. Two reasons:

  1. Scale. When d_model = 4096, an extra 4096-element vector per linear is rounding error in parameter count, but it's another fragile thing to learn. Empirically, removing biases doesn't hurt loss.
  2. LayerNorm absorbs them. A LayerNorm right before the linear already injects a learned mean shift; another bias is redundant.

GPT-2 had biases. Most post-2022 architectures don't. We'll keep biases today because we're learning, but it's worth knowing the convention shifted.

The first neural network ever built — the Perceptron — was constructed in 1958 by Frank Rosenblatt, not in software but as a physical machine with 400 photocells and motorized weights. The New York Times reported it would soon "be conscious of its existence." Then it failed at XOR. AI entered its first winter.

Activations

The source of expressivity is the non-linearity.

Why we need non-linearity

Stack two linear layers with no activation between them:

y = (X W₁) W₂ = X (W₁ W₂)

Matrix multiplication is associative — so W₁ W₂ is just another matrix. Stacking 100 linear layers without activations collapses to one equivalent linear layer. The whole network can only learn linear functions of its inputs.

The proof is one line. Concrete demo:

import numpy as np
W1 = np.random.randn(4, 16)
W2 = np.random.randn(16, 3)
x  = np.random.randn(4)
print(((x @ W1) @ W2))         # two linear ops
print(x @ (W1 @ W2))           # one linear op — same result
# array agreement: tiny float drift only

Adding any non-linearity between the layers — max(0, ·), tanh, anything — breaks associativity and the network can suddenly approximate functions a single linear layer can't.

Universal Approximation Theorem (Hornik 1989, Cybenko 1989). A feed-forward network with at least one hidden layer and a non-polynomial activation can approximate any continuous function on a compact set, given enough neurons. The theorem doesn't say how many neurons — it might need a billion — but it justifies the architecture. Depth × non-linearity = expressivity.

The cast of activations

Each activation is a scalar function f: ℝ → ℝ applied element-wise to the layer pre-activation z. Their derivatives matter, because backward will multiply by them.

ReLU max(0, x) Leaky ReLU max(0.01x, x) GELU x · Φ(x) SiLU / Swish x · σ(x) tanh (eˣ−e⁻ˣ)/(eˣ+e⁻ˣ) sigmoid 1/(1+e⁻ˣ) All plotted on x ∈ [−4, 4]. Output range varies. Notice: ReLU has a hard kink. GELU/SiLU are smooth ReLU look-alikes. Sigmoid/tanh saturate at ±∞ — the source of the vanishing-gradient problem.
Six activations to know. The smoothness of GELU and SiLU is mostly aesthetic — empirically they each give a small but consistent improvement over ReLU on Transformer FFNs.
ActivationFormulaDerivativeOutput rangeUsed in
ReLUmax(0, x)1 if x > 0 else 0[0, ∞)The default since 2012. Vision, vanilla MLPs, GPT-1/2 FFN.
Leaky ReLUmax(αx, x), α≈0.011 if x > 0 else α(−∞, ∞)When ReLU "dies" (neurons stuck at zero).
GELUx · Φ(x) (Φ = Gaussian CDF)Φ(x) + x · φ(x)≈[−0.17, ∞)BERT, GPT-2, GPT-3. Smooth ReLU.
SiLU / Swishx · σ(x)σ(x) + x·σ(x)(1−σ(x))≈[−0.28, ∞)LLaMA, Mistral, Qwen FFN. Often inside SwiGLU.
tanh(eˣ − e⁻ˣ)/(eˣ + e⁻ˣ)1 − tanh²(x)(−1, 1)RNNs, GRU/LSTM gates, sometimes policy heads.
Sigmoid1 / (1 + e⁻ˣ)σ(x)(1 − σ(x))(0, 1)Final layer for binary classification; gates inside LSTMs.
Softmaxeˣⁱ / Σⱼ eˣʲvector-valued (see below)row sums to 1The output of every classifier. The weights inside attention.

The vanishing gradient problem (and why ReLU saved deep learning)

When you stack tanh activations and the pre-activations get large, tanh'(z) = 1 − tanh²(z) ≈ 0 for |z| > 2. Multiplying small gradients through 50 layers of backward gives you something effectively zero. Early-layer parameters stop receiving useful updates. Training stalls.

ReLU's derivative is exactly 1 for x > 0 — no decay. Stacking 100 ReLU layers, the gradient still has full magnitude in the active half. This single change, plus better initialization (next section), is what made deep nets viable. Before 2012, "deep" meant 3-5 layers; after, hundreds.

Softmax is the universal output

For multi-class classification with C classes, the network outputs C real numbers — logits — one per class. Softmax converts them into a probability distribution:

softmax(z)ᵢ = eᶻⁱ / Σⱼ eᶻʲ

Two important properties:

  • Each output is in (0, 1) and the row sums to 1. So the row is a valid probability distribution.
  • Translation-invariant. softmax(z) = softmax(z − max(z)). We always subtract the max in practice for numerical stability — otherwise eᶻ overflows.
def softmax_stable(z):
    z = z - z.max(axis=-1, keepdims=True)   # numerical safety
    e = np.exp(z)
    return e / e.sum(axis=-1, keepdims=True)

Softmax has a vector-valued derivative (a Jacobian, not a scalar). We never compute it directly — we always pair it with cross-entropy loss, and the product simplifies to something gorgeous. Two sections from now.

The Forward Pass

Two matmuls, one ReLU, one softmax, one cross-entropy.

Let's build a 2-layer MLP for classification:

in_dim = 4 # features per example hidden_dim = 16 out_dim = 3 # classes batch_size = 32

Forward pass, with every intermediate shape labeled:

X (B, in_dim) = (32, 4) # batch of inputs z₁ = X W₁ + b₁ (B, hidden_dim) = (32, 16) a₁ = ReLU(z₁) (B, hidden_dim) = (32, 16) z₂ = a₁ W₂ + b₂ (B, out_dim) = (32, 3) # logits p = softmax(z₂) (B, out_dim) = (32, 3) # row sums to 1 L = − (1/B) Σᵢ log p_{i, yᵢ} # scalar

That's the whole forward pass: two matmuls, one ReLU, one softmax, one cross-entropy reduction. Every transformer FFN follows the same shape — they just have bigger numbers.

Parameter count for this toy net: (4·16 + 16) + (16·3 + 3) = 80 + 51 = 131 trainable scalars.

Forward pass shapes (B = 32): X (32, 4) XW₁ + b₁ = z₁ (32, 16) ReLU(z₁) a₁ : (32, 16) a₁W₂ + b₂ = z₂ (32, 3) logits softmax p : (32, 3) L scalar Per-layer params: W₁ (4, 16) + b₁ (16,) = 80 W₂ (16, 3) + b₂ (3,) = 51 total = 131 trainable scalars
Pin this picture. Trace each tensor's shape with your finger. The transformer FFN is the same diagram with d_model = 4096 and a smooth activation in the middle.
def forward(X):
    z1 = X @ W1 + b1        # (B, hidden_dim)
    a1 = np.maximum(0, z1)  # ReLU
    z2 = a1 @ W2 + b2       # (B, out_dim) — logits
    return z1, a1, z2       # cache for backward

We return the intermediates z1 and a1, not just the output z2. The backward pass needs them. This is the same trick autograd does under the hood — store activations during forward so you can multiply by them on the way back.

Memory footprint of the forward pass

Activations dominate VRAM during training. Quick math for a Transformer block at B=8, T=2048, D=4096:

hidden activation a₁ shape: (B, T, 4D) = (8, 2048, 16384) size in fp16: ~537 MB per layer

For 32 layers stacked, that's ~17 GB of activations alone, just for the FFN intermediates. This is why gradient checkpointing (recomputing some activations on backward instead of storing them) is everywhere in LLM training. We'll meet it on Day 10.

Why this matters for inference

The transformer FFN sublayer — the one you'll actually be serving — is exactly this two-linear-layer MLP, just with bigger dimensions. In LLaMA-7B, the FFN expands to 4 × d_model = 16384 hidden units and uses SiLU (a smooth ReLU lookalike). Every inference request runs:

gate = SiLU(X W_gate) # (B, T, 16384) ← one big matmul up = X W_up # (B, T, 16384) ← second big matmul out = (gate ⊙ up) W_down # (B, T, 4096) ← third big matmul ← three matmuls per layer, ~two-thirds of all FLOPs in a forward pass

Params vs FLOPs. One FFN layer in a 7B model has ~3 × 4096 × 16384 ≈ 200M parameters. At batch size 1 (common during inference), every parameter is loaded from memory but used for just a handful of multiplications — so FFN layers are memory-bandwidth bound, not compute bound. Reducing FFN weights (quantization, pruning) is the single biggest lever on serving cost. You'll need to understand this MLP before any of that makes sense.

The Loss

Cross-entropy on raw logits, every time.

For classification, cross-entropy is the universal loss. From information theory (Day 1, Shannon's entropy): the average number of bits you need to encode the true distribution if you assume the model's distribution. Lower is better; zero is perfection.

Formally, for one example with true class y and predicted distribution p:

CE(p, y) = − Σ_c 1[c = y] · log p_c = − log p_y

That log p_y is the only term that survives because all other indicators are zero. Average over the batch:

L = − (1/B) Σᵢ log p_{i, yᵢ} # scalar

If the model is perfectly confident on the correct class (p_y = 1), the loss is −log 1 = 0. If it's spreading mass evenly across C classes (p_y = 1/C), the loss is log C — a useful baseline. For C = 10 (MNIST), log 10 ≈ 2.30. A randomly initialized 10-class classifier should output a loss of ~2.3 on the first batch. If yours doesn't, you've already got a bug.

def cross_entropy_loss(logits, y):
    # logits: (B, C); y: (B,) integer labels
    z = logits - logits.max(axis=-1, keepdims=True)   # stability
    log_sum_exp = np.log(np.exp(z).sum(axis=-1, keepdims=True))
    log_p = z - log_sum_exp                            # (B, C)
    nll   = -log_p[np.arange(len(y)), y]               # (B,)
    return nll.mean(), softmax(logits)

Practical note. Always compute cross-entropy on raw logits (F.cross_entropy(logits, y) in PyTorch). Computing softmax then log separately loses precision — log(softmax(z)) is mathematically the same but numerically worse. The fused log_softmax + nll_loss (which cross_entropy uses) is the right call.

Why softmax + cross-entropy gives the cleanest possible gradient

Here's the magic. When you compose p = softmax(z) and L = − log p_y, the derivative simplifies dramatically:

∂L/∂zᵢ = pᵢ − 1[i = y]

In words: the gradient of the loss with respect to the logits is just the predicted distribution minus the one-hot true label. No softmax Jacobian. No log-derivative. Just (p − y).

For a batch, divide by B because we averaged the loss:

dz₂ = (p − one_hot(y)) / B # (B, out_dim)

This is the gradient that starts the entire backward walk. Memorize it. It's the reason every classifier ever has trained as fast as it does.

The one-line derivation:

L = − log p_y where p_y = e^{z_y} / Σⱼ e^{z_j} ∂p_y / ∂z_i = p_y · (1[i = y] − p_i) # softmax Jacobian ∂L / ∂z_i = − (1/p_y) · ∂p_y/∂z_i # log-chain = − (1[i = y] − p_i) = p_i − 1[i = y] # ✓

This clean form is one reason cross-entropy is the universal classification loss. Squared error on probabilities (mean-squared error on softmax outputs) gives a sloppier gradient with (p − y) · p · (1 − p) factors that vanish at the extremes — bad for optimization.

Cross-entropy loss comes directly from Claude Shannon's 1948 information theory. Shannon was answering a different question (how to compress messages efficiently); the same math measures how surprised your model is by the correct answer. Low cross-entropy = the model wasn't surprised. Compression and intelligence are the same problem in disguise.

The Backward Pass

Six lines of chain rule. Derive them once.

We just got dz₂ = (p − one_hot(y)) / B. From there, the chain rule goes layer by layer.

The cheat sheet — derivatives of every primitive in our forward pass:

Forward opLocal derivativeNotes
z₂ = a₁ W₂ + b₂∂L/∂W₂ = a₁ᵀ · dz₂, ∂L/∂b₂ = Σ dz₂, ∂L/∂a₁ = dz₂ · W₂ᵀStandard linear layer backward
a₁ = ReLU(z₁)∂a₁/∂z₁ = 1[z₁ > 0]Element-wise mask
z₁ = X W₁ + b₁∂L/∂W₁ = Xᵀ · dz₁, ∂L/∂b₁ = Σ dz₁, ∂L/∂X = dz₁ · W₁ᵀSame shape; upstream is dz₁

Apply them in reverse. With dz₂ already in hand:

dW₂ = a₁ᵀ @ dz₂ (hidden, out) db₂ = dz₂.sum(axis=0) (out,) da₁ = dz₂ @ W₂ᵀ (B, hidden) dz₁ = da₁ * (z₁ > 0) (B, hidden) # ReLU mask dW₁ = Xᵀ @ dz₁ (in, hidden) db₁ = dz₁.sum(axis=0) (hidden,)

Six lines. Every gradient autograd would compute, by hand. Match them to your forward pass shapes — there's exactly one transposition per matmul, and biases sum across the batch axis (because each example contributed independently to the bias).

Backward pass — gradients flow right → left X z₁ a₁ z₂ p L seed = 1 dz₂ = (p − y)/B da₁ = dz₂·W₂ᵀ dz₁ = da₁⊙1[z₁>0] dX = dz₁·W₁ᵀ Parameter gradients (deposited into .grad): dW₂ = a₁ᵀ · dz₂ db₂ = dz₂.sum(0) dW₁ = Xᵀ · dz₁ db₁ = dz₁.sum(0) Each backward op multiplies an upstream gradient by a local Jacobian. Cost ≈ forward.
The full backward pass on a 2-layer MLP. Six matmul-or-reduction lines. Verify yourself: every gradient PyTorch's autograd computes is identical.

Why the bias gradient is a sum

Each row i of the batch contributes bᵢ independently. So ∂L/∂b = Σᵢ ∂L/∂(z + b)ᵢ = Σᵢ dzᵢ. The gradient w.r.t. the broadcasted dimension is the sum across that dimension. This is a general rule — the gradient w.r.t. a broadcasted operand is the reduction of the gradient over the broadcast axes. PyTorch and MLX both implement this correctly.

Why the gradient is Xᵀ · dz

For z = X · W, we have zₖₗ = Σⱼ Xₖⱼ · Wⱼₗ. So ∂zₖₗ/∂Wᵢⱼ = Xₖᵢ · 1[j = ℓ]. Chain rule:

∂L/∂Wᵢⱼ = Σₖₗ (∂L/∂zₖₗ)·(∂zₖₗ/∂Wᵢⱼ) = Σₖ Xₖᵢ · dzₖⱼ = (Xᵀ · dz)ᵢⱼ

In one line: transpose the input, matmul with the upstream gradient, get the weight gradient. Same shape as W. Dimension check: (d_in, B) @ (B, d_out) = (d_in, d_out) ✓.

ReLU seems obvious now, but neural networks used tanh and sigmoid almost exclusively until Glorot et al. 2011 showed ReLU dramatically improved deep network training. The fix for the vanishing gradient problemmax(0, x) — was hiding in plain sight for decades.

Initialization

A poor init kills training before step 1.

You can have the right architecture, the right loss, the right optimizer, and still training won't converge — because the weights started at the wrong scale.

The intuition: a forward pass through L linear layers multiplies the input variance by roughly (σ²·n)ᴸ where σ is the weight scale and n is the fan-in. If σ²·n > 1, activations explode (NaNs by layer 30). If σ²·n < 1, activations vanish (every neuron outputs roughly zero by layer 30, no gradient flows back). You need σ²·n ≈ 1.

There's a similar story on the backward pass. With activation f, the variance of gradients flowing backward through the layer is multiplied by (σ²·n_out · 𝔼[f'(z)²]). So the "right" σ depends on the activation function:

ActivationBest init (variance)Common nameUsed with
tanh, sigmoidσ² = 1 / fan_inXavier / GlorotRNNs, older models
ReLU, leaky ReLUσ² = 2 / fan_inHe / KaimingCNNs, MLPs, transformer FFN
Generic linear / no activationσ² = 1 / fan_inXavier-uniformFinal classifier head
Modern transformer linearsσ ≈ 0.02 (constant)GPT-2 initLLaMA, GPT-2/3, etc.

Why 2 / fan_in for ReLU? Because ReLU zeroes out half the activations in expectation, so to keep the active half's variance unit, you need to double the initial weight variance. Xavier was derived for symmetric tanh; He generalized it to ReLU's asymmetry.

# He init for ReLU networks
W1 = rng.standard_normal((in_dim, hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim)               # bias init = 0 (always)

Bias init is always zero. There's nothing asymmetric about a bias; starting it at zero gives the network a clean reference and lets the data move it.

Modern transformer trick. GPT-2 and most descendants use a constant σ = 0.02 for all linears, and additionally scale residual-branch outputs by 1/√(2L) (where L is layer count) to prevent output magnitudes from blowing up as you stack more blocks. We'll see this on Day 12 when we build a real GPT.

A botched init is among the most common training-failure modes — and one of the easiest to debug. If your loss starts at 50 instead of log(C), your init is wrong.

The Update Rule

SGD: w ← w − η · ∇w. Repeat.

Given gradients, stochastic gradient descent moves each parameter in the negative-gradient direction:

W ← W − η · dW b ← b − η · db

η (the Greek letter eta) is the learning rate — how large a step to take. Concrete numbers (η = 0.05, dW₂[0,0] = 0.13):

W₂[0,0]_new = W₂[0,0]_old − 0.05 · 0.13 = W₂[0,0]_old − 0.0065

Each step nudges every parameter by η · grad. Repeat thousands of times.

η too small η just right η too large min tiny steps loss crawls, takes ×10 steps min start smooth descent, converges min bounces past minimum loss diverges or NaNs Loss surface cross-section. The bowl is loss as a function of one weight. SGD takes steps down the slope.
The loss surface as a bowl. A too-small learning rate crawls. A too-large one bounces past the minimum and diverges. Just right: smooth descent. This is why the default learning rate for SGD on MLPs is 0.01–0.1, and why you always check the initial loss curve first.

Learning rate intuition. Too low: training crawls. Too high: loss bounces or NaNs out — you're stepping past the minimum. Empirical default for SGD on small MLPs: η ∈ [0.01, 0.1]. For Adam (Day 4): η ∈ [1e-4, 3e-4]. Loss curves diverging by step 2 = your η is way too high.

W1 -= lr * dW1; b1 -= lr * db1
W2 -= lr * dW2; b2 -= lr * db2

That's vanilla SGD. Day 4 replaces it with momentum, then Adam, then AdamW — all small variations on this single line.

Mini-batches: why we don't use the full dataset

Three options for choosing how many examples per gradient step:

  • Batch gradient descent — full dataset, one step per epoch. Slow per step, smooth gradient.
  • Stochastic gradient descent — one example per step. Fast per step, very noisy gradient.
  • Mini-batch SGDB examples per step (typically 32–512). The compromise everyone uses.

Mini-batches give you three benefits at once:

  1. Statistical: noise helps escape sharp local minima.
  2. Computational: GPUs are massively parallel; doing one example wastes 99% of the silicon. Doing 256 in parallel keeps the kernels warm.
  3. Memory: full-batch on a 1B-example dataset would need terabytes of RAM. B = 256 fits.

Modern LLM training typically uses batch sizes of 1–4 million tokens (collected via gradient accumulation across many GPUs) — the noise is so small that AdamW essentially behaves like full-batch.

NumPy MLP — End to End

Type it. Run it. Watch loss drop.

This is the lesson. Type it. Run it. Watch loss drop.

import numpy as np

rng = np.random.default_rng(42)

# ---- Hyperparameters ----
in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32
lr = 0.05
n_steps = 2000

# ---- Data: 3 Gaussian blobs in 4D ----
N = 1024
class_centers = rng.standard_normal((out_dim, in_dim)) * 3.0
y = rng.integers(0, out_dim, N)
X = rng.standard_normal((N, in_dim)) + class_centers[y]

# ---- Init: He for ReLU, zero bias ----
W1 = rng.standard_normal((in_dim,    hidden_dim)) * np.sqrt(2.0 / in_dim)
b1 = np.zeros(hidden_dim)
W2 = rng.standard_normal((hidden_dim, out_dim))   * np.sqrt(2.0 / hidden_dim)
b2 = np.zeros(out_dim)

def relu(z): return np.maximum(0, z)
def softmax(z):
    z = z - z.max(axis=-1, keepdims=True)
    e = np.exp(z); return e / e.sum(axis=-1, keepdims=True)

def forward(X):
    z1 = X @ W1 + b1
    a1 = relu(z1)
    z2 = a1 @ W2 + b2
    return z1, a1, z2

def cross_entropy(logits, y):
    p = softmax(logits)
    log_p_correct = np.log(p[np.arange(len(y)), y] + 1e-12)
    return -log_p_correct.mean(), p

def backward(X, y, z1, a1, p):
    B = X.shape[0]
    dz2 = p.copy()
    dz2[np.arange(B), y] -= 1.0
    dz2 /= B
    dW2 = a1.T @ dz2
    db2 = dz2.sum(axis=0)
    da1 = dz2 @ W2.T
    dz1 = da1 * (z1 > 0)
    dW1 = X.T @ dz1
    db1 = dz1.sum(axis=0)
    return dW1, db1, dW2, db2

# ---- Training loop ----
for step in range(n_steps):
    idx = rng.choice(N, batch_size, replace=False)
    Xb, yb = X[idx], y[idx]

    z1, a1, z2 = forward(Xb)
    loss, p = cross_entropy(z2, yb)
    dW1, db1, dW2, db2 = backward(Xb, yb, z1, a1, p)

    W1 -= lr * dW1;  b1 -= lr * db1
    W2 -= lr * dW2;  b2 -= lr * db2

    if step % 200 == 0:
        acc = (z2.argmax(-1) == yb).mean()
        print(f"step {step:4d}   loss {loss:.4f}   train-acc {acc:.3f}")

What you should see — three rules of thumb that work for almost any MLP:

  1. Initial loss ≈ log(out_dim). Here log(3) ≈ 1.10. If your first step prints something wildly different (50? NaN?) your init is broken.
  2. Loss drops monotonically over the first 100 steps. Not strictly — there's batch noise — but the trend is clearly down.
  3. Loss approaches 0 and accuracy approaches 1.0 by step ~1000 on this synthetic dataset. Linearly separable clusters; a 2-layer MLP solves it trivially.

If any of those three are off, stop and debug:

  • Loss NaN: lower lr, double-check init.
  • Loss flat: probably lr too low or weights stuck at zero (no init or all-zero init).
  • Loss climbing: lr way too high, you're bouncing past the minimum.
PyTorch — The Easy Way

Same network. Autograd does the backward.

Identical math, autograd does the backward.

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)

in_dim, hidden_dim, out_dim = 4, 16, 3
batch_size = 32

class MLP(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, out_dim)

    def forward(self, x):
        return self.fc2(F.relu(self.fc1(x)))

model = MLP()
opt   = torch.optim.SGD(model.parameters(), lr=0.05)

# (Generate Gaussian-blob data as in the NumPy version.)
X = torch.randn(1024, in_dim) + class_centers[y]
y = torch.from_numpy(y)

for step in range(2000):
    idx = torch.randperm(len(X))[:batch_size]
    Xb, yb = X[idx], y[idx]

    logits = model(Xb)
    loss   = F.cross_entropy(logits, yb)            # softmax + NLL fused

    opt.zero_grad()
    loss.backward()
    opt.step()

    if step % 200 == 0:
        acc = (logits.argmax(-1) == yb).float().mean()
        print(f"step {step:4d}   loss {loss.item():.4f}   train-acc {acc:.3f}")

Same hyperparameters, same data, same outcome. Side-by-side, the differences are mechanical:

WhatNumPyPyTorch
Parameter containerW1, b1, W2, b2 globalsmodel.parameters() from nn.Module
Initmanual np.sqrt(2/fan_in)automatic (Kaiming uniform default)
Forwardforward(X) returning intermediatesmodel(X) returning logits
Losshand-written cross-entropyF.cross_entropy(logits, y) (fused)
Backwardbackward(...) returning gradsloss.backward()
UpdateW1 -= lr * dW1; …opt.step()
Zero gradsn/aopt.zero_grad() before backward

nn.Module does three things for you: tracks parameters, registers them with the optimizer, routes the forward pass through registered submodules. Everything else is the same.

The four-line heartbeat — opt.zero_grad()loss.backward()opt.step() (preceded by the forward) — is what every PyTorch training loop ever written looks like. Same on Day 9 (tiny GPT), Day 10 (mixed precision), Day 11 (distributed). Memorize it.

MLX — The Apple-Silicon Equivalent

Functional autograd on unified memory.

For Apple Silicon, MLX is the native equivalent. The structure is functional — gradients are returned from a function, not deposited on .grad.

import mlx.core as mx
import mlx.nn as nn
import mlx.optimizers as optim

mx.random.seed(42)

class MLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, out_dim)

    def __call__(self, x):
        return self.fc2(nn.relu(self.fc1(x)))

model = MLP(4, 16, 3)
opt   = optim.SGD(learning_rate=0.05)

X = mx.random.normal((1024, 4))
y = mx.random.randint(0, 3, (1024,))

def loss_fn(model, x, y):
    return nn.losses.cross_entropy(model(x), y, reduction="mean")

loss_and_grad = nn.value_and_grad(model, loss_fn)

for step in range(2000):
    idx = mx.random.permutation(len(X))[:32]
    Xb, yb = X[idx], y[idx]

    loss, grads = loss_and_grad(model, Xb, yb)
    opt.update(model, grads)
    mx.eval(model.parameters(), opt.state)        # cap deferred graph

    if step % 200 == 0:
        print(f"step {step:4d}   loss {loss.item():.4f}")

Three things to notice vs PyTorch:

  1. No .grad on tensors. loss_and_grad returns the loss and a parameter pytree of gradients in one call.
  2. No zero_grad. Functional API, every call returns fresh grads.
  3. mx.eval after each step. MLX is lazy — without periodic eval, the deferred computation graph grows unboundedly.

Day 2 covered the lazy-eval mental model in depth; we're just applying it.

"PyTorch is just a giant chain-rule engine. MLX is a giant chain-rule engine that defers."

The mental model · Day 3
Diagnostics

Underfitting vs overfitting, by curve shape.

Train and validate. Plot both losses against step.

Underfit both losses high & flat → bigger model, more compute Just right both drop, val tracks train → stop here Overfit train ↓, val ↑ after some step → regularize, more data, early-stop train val
Three curve shapes. Most training problems show up here within 100 steps — read the curve before you tune anything else.

Underfitting. Both training and validation loss are high and flat-ish. The model can't capture the pattern — it's too small, the lr is too low, or the features are inadequate. Counters: bigger model, longer training, better features.

Overfitting. Training loss keeps dropping; validation loss bottoms out and climbs back up. The model has memorized the training set and is now noise-fitting. Counters: more data (the cleanest fix), regularization (weight decay, dropout), early stopping at the validation minimum, smaller model.

Just right. Both losses drop, validation tracks training, the gap is small. This is what you want.

LLM pre-training is a curious case: the data set is so much larger than the model's capacity that overfitting is rare — the model never sees the same example twice. The opposite problem dominates: how to train more efficiently on a fixed compute budget. We'll get there on Day 8 when we cover scaling laws.

Capacity: depth vs width

How much compute should you put where? Two extremes:

  • Wider, shallower (e.g., a 1-layer net with 100k hidden neurons). Easy to optimize, every neuron sees the input. Limited expressivity for hierarchical features.
  • Deeper, narrower (e.g., a 50-layer net with 256 neurons). Harder to optimize (vanishing gradients, residual connections become essential), but each layer composes the previous one's features into higher-level abstractions.

Empirically, depth wins for most real tasks — but only if you can train it. Modern transformers go ~30-100 layers deep with d_model = 4096. The reason this works: residual connections (Day 7), pre-LayerNorm (Day 7), and well-tuned init (this lesson). Without all three, training a 30-layer transformer fails.

Quick capacity rules of thumb:

  • MLP (this lesson): 1-3 hidden layers, 16-512 hidden units. Plenty for tabular data.
  • CNN (vision, not our path): tens of layers, residual connections.
  • Transformer (our path): 12-100 layers, d_model 768-8192, 4 · d_model FFN width.

MNIST — the dataset of 70,000 handwritten digits — was created in 1998 by Yann LeCun and team. It's been used in essentially every deep learning paper that needed a sanity-check baseline. People still use it. It's the "Hello World" of ML.

Exercise

Type, perturb, port, debug.

Run these in a Python REPL or notebook. Type the code; don't copy-paste. The point is to feel the shapes.

1. Type the NumPy MLP

Run it. Watch loss drop. Then perturb:

  • Set lr = 5.0. What does the loss curve look like? Why?
  • Set lr = 1e-5. How many steps before any progress?
  • Set hidden_dim = 1. Why does accuracy plateau at ~33%?
  • Skip the He init (use plain randn). Does training still converge? How many steps does it take?

2. Add a third hidden layer

Extend the NumPy version to d_in → 16 → 16 → 3. Derive the new backward by hand on paper. Implement it, train it, verify it converges.

3. Replace ReLU with tanh

In both NumPy and PyTorch versions. The derivative of tanh(z) is 1 − tanh²(z). Update the NumPy backward; PyTorch handles it for you. Compare convergence speeds — tanh is typically slower in deep nets.

4. Why does the loss start at ~1.10?

Predict on paper, then verify. (Hint: random init means roughly uniform predictions across 3 classes.)

5. Real data — MNIST

from torchvision import datasets, transforms
from torch.utils.data import DataLoader

train_ds = datasets.MNIST('./data', train=True, download=True,
                          transform=transforms.ToTensor())
train_loader = DataLoader(train_ds, batch_size=128, shuffle=True)

Flatten the 28×28 image to a 784-vector. Build a 784 → 256 → 10 MLP. Train for 5 epochs. Target test accuracy: ≥ 95%. Plot training and validation loss against step. Is your network overfitting? Add weight decay (SGD(..., weight_decay=1e-4)) and re-plot.

6. (Mac) Port to MLX

Translate your MNIST MLP from PyTorch to MLX. Compare wall-clock training time. (M-series Macs typically train a 256-hidden MLP on MNIST faster than CUDA on integrated GPUs and on par with mid-range discrete GPUs.)

7. Pick the bug

Five common training-loop bugs. Read the symptoms. Predict which line caused each:

SymptomLikely bug
Loss is NaN after step 1(a) (b) (c) (d) (e)
Loss is constant for 1000 steps(a) (b) (c) (d) (e)
Loss decreases on train but val accuracy is 10% (random)(a) (b) (c) (d) (e)
Train accuracy is 99%, val is 60%(a) (b) (c) (d) (e)
Loss starts at ~50 instead of ~2.30(a) (b) (c) (d) (e)

(a) Forgot opt.zero_grad() (grads accumulate)
(b) Used lr = 100
(c) Used lr = 1e-9
(d) Forgot to scale init by √(2/n) → init too large
(e) Forgot to shuffle the dataset

Further Reading

Go deeper.

Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.

YouTube · Free · 2.5h

Karpathy — building micrograd

The single best resource for today's lesson. Builds an autograd-equipped neural network from scratch in pure Python from absolute zero.

Watch on YouTube
GitHub · 150 lines

karpathy/micrograd

Scalar-valued autograd in 150 lines. Read end to end after the video.

Open repo
Free book · Online

Nielsen — Neural Networks and Deep Learning

Chapters 1-2. The clearest backprop introduction ever written.

Read online
Free book · Reference

Goodfellow et al. — Deep Learning Ch. 6

Deep Feedforward Networks. The canonical textbook chapter.

Read online
Book · Manning

Raschka — Build an LLM From Scratch

Chapter 2. Closest book to this curriculum.

Open page
Video · Visual

3Blue1Brown — What is backpropagation really doing?

Episode 4 of the neural networks series. Animations explain the chain rule better than equations.

Watch on YouTube
Browser · Interactive

TensorFlow Playground

Train a 2-layer net in your browser. Watch decision boundaries form. Pure intuition.

Open playground
Browser · Karpathy

ConvNetJS demos

In-browser net visualizer. Watch a network learn live.

Open demo
Paper · 1986

Rumelhart, Hinton, Williams — Backprop

The Nature paper that popularized backprop. Read for historical context.

Open paper
Paper · 2015

He et al. — Delving Deep into Rectifiers

Where He init comes from. Section 2.2 has the variance derivation.

Read on arXiv
Paper · 2010

Glorot & Bengio — Difficulty of training deep nets

Where Xavier init comes from. The paper that started the init revolution.

Open paper
Paper · 2016

Hendrycks & Gimpel — GELU

Where the activation in GPT-2 / BERT comes from.

Read on arXiv
Paper · 2017

Ramachandran, Zoph, Le — Swish

Where SiLU/Swish comes from. Used in LLaMA, Mistral, Qwen FFN.

Read on arXiv
Blog · Visual

Olah — Calculus on Computational Graphs

Christopher Olah's clean visual explanation of forward vs reverse mode autodiff.

Read post
Source · PyTorch

torch.nn.Linear source

The actual nn.Linear you use. About 100 lines, including init.

Open file
Source · MLX

mlx.nn.Linear source

MLX's equivalent linear layer. Same architecture, slightly different conventions.

Open file
Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

  1. Two stacked linear layers Y = (X W₁) W₂ are equivalent to a single linear layer with weight matrix W = ?. Why does this break with a non-linearity in between?
  2. Write out the derivative of softmax + cross-entropy w.r.t. the logits. Why is this expression so much simpler than computing each piece separately?
  3. He initialization scales weights by √(2/fan_in). Where does the 2 come from?
  4. In our manual backward, why does db = dz.sum(axis=0)? What's special about a bias?
  5. If your training loss is exactly log(C) for many steps and never moves, what's the most likely bug?
  6. Why is bias init always zero? Why isn't weight init also zero?
  7. Predict the initial loss for a 10-class classifier with no training, randomly initialized.
  8. Your network's final layer outputs logits. Why do we never apply a softmax before the loss in PyTorch?
  9. You scale your input by 1000× without changing weights. What happens to the first layer's pre-activation? To the gradient flowing back? Which init constant should you adjust to compensate?
  10. The Universal Approximation Theorem says a 1-hidden-layer network can approximate any function. Why do we use deep networks anyway?