LLM Inference Engineer · Day 04
Day 04 · Week 1 · Foundations
⚙️

Backprop & Optimizers

Yesterday you wrote a backward pass by hand. Today we make it general. Backpropagation turns out to be a single idea — the chain rule, applied to a graph of operations, walked in reverse — and once you see that idea, every framework's autograd stops feeling like magic.

Time~120 min
DifficultyMedium-Hard
PrerequisiteDay 1–3
Why This Lesson

From a hand-derived backward pass to autograd that scales to 70 billion parameters.

On Day 3 we wrote the backward pass for a 2-layer MLP by hand. It was six lines of NumPy, and you understood every term. That works wonderfully for a small network, but it does not scale. A real Transformer has hundreds of operations per layer, and writing every backward pass by hand would take days and produce more bugs than working code.

So today we generalize. The good news is that there is only one idea to learn. Backpropagation is the chain rule from calculus, applied to a computational graph, walked in reverse from the loss back to the parameters. Every framework's automatic differentiation system — PyTorch's loss.backward(), MLX's mx.grad, JAX's grad — is a careful mechanization of that one idea. Once you have seen the shape of the algorithm clearly, the libraries stop feeling magical and start feeling like patient bookkeepers.

After backprop we turn to optimizers. An optimizer is the thing that decides how to use a gradient once you have one. Plain SGD does the obvious thing: subtract some fraction of the gradient from each parameter. Real LLM training does something more elaborate — it uses AdamW with warmup and cosine decay. That recipe took the field about a decade to arrive at. Knowing why each ingredient is in there — momentum, second-moment scaling, decoupled weight decay, warmup, cosine — is what separates "I copied the recipe" from "I can debug a training run when something goes wrong."

Three concrete reference points for what builds on this lesson:

  • Day 9 — when we train a tiny GPT, every line of the optimizer call uses ideas from today.
  • Day 10 — gradient clipping, mixed precision, and a proper training loop. Today is the prerequisite.
  • Day 13 — LoRA and QLoRA fine-tuning. Adam's internal state is what dominates fine-tuning memory, so you need to feel Adam's m and v to estimate fine-tune cost.

Learning objectives

  1. Draw the forward computational graph for an arithmetic expression and walk it backward to compute every gradient by hand.
  2. State the local Jacobian for the operations you will see most often: addition, multiplication, matrix multiplication, ReLU, sigmoid, and the fused softmax-plus-cross-entropy.
  3. Trace the backward pass through a linear layer and an activation, naming the upstream gradient, local Jacobian, and the output gradient at each step.
  4. State and explain the AdamW update from memory, decoded one term at a time — what β₁, β₂, ε, and λ each control.
  5. Pick a learning-rate schedule that is appropriate for an LLM training run, and write linear warmup followed by cosine decay in five lines of Python.
  6. Diagnose three classic training pathologies — vanishing gradients, exploding gradients, dead ReLUs — by recognizing their signatures in the loss curve and the gradient-norm trace.
  7. Choose between Xavier and He initialization, and explain where the 2/fan_in constant in He's formula comes from.
  8. Explain why inference uses far less memory than training in terms of what storage is omitted.
Math Notation Cheatsheet

Every symbol used in this lesson, decoded once before use.

This is a math-heavy lesson. Every symbol is defined here in plain English and with a Python analogy. When you see a symbol later, come back to this table. You don't need to memorize it now — just know it exists.

SymbolReads asPython analogyExample in this lesson
∂L/∂w"partial derivative of L with respect to w"(L(w+h) - L(w-h)) / (2*h) for tiny hHow much does the loss change if we nudge weight w?
∇L"gradient of L" — pronounced "nabla L" or "del L"[dL/dw for w in params] — a list of partialsThe full gradient vector over all parameters
∂z/∂x · ∂L/∂zchain rule — multiply local derivative by upstream gradientlocal_grad * upstream_gradEvery backward pass node does exactly this
Wᵀ"W transposed" — rows and columns swappedW.T∂L/∂X = ∂L/∂Y · Wᵀ
β₁, β₂"beta one, beta two" — exponential decay factorsscalars between 0 and 1; e.g. 0.9, 0.999Adam's momentum and second-moment decay rates
m̂, v̂"m-hat, v-hat" — bias-corrected estimatesm / (1 - beta1**t)Adam corrects early underestimation of moments
ε"epsilon" — a tiny constant added for stability1e-8Prevents divide-by-zero in Adam: m̂ / (√v̂ + ε)
λ"lambda" — weight decay strengthweight_decay = 0.1AdamW subtracts lr · λ · w directly
√x"square root of x"x ** 0.5 or math.sqrt(x)Adam: √v̂ is the per-parameter RMS gradient
1[x > 0]"indicator function — 1 if x > 0, else 0"float(x > 0)ReLU local derivative
O(n)"order n" — grows proportionally to nlike saying "scales linearly with n"Reverse-mode autodiff is O(graph size)

The one thing to hold onto before reading further: ∂L/∂w is just a number that answers the question "if I add a tiny bit to w, how much does L change?" The chain rule lets you compute that number for every weight in a million-parameter network in a single backward pass.

Backprop, Concretely

Walk the graph backward. At every node, multiply by the local Jacobian.

Every computation a neural network performs can be drawn as a directed acyclic graph. The leaves are the inputs (data) and the parameters (weights and biases). The interior nodes are arithmetic operations. The root is the loss. Forward computation goes from the leaves to the root, computing one node value at a time. Backward computation goes from the root back to the leaves, computing one gradient at a time. That is the entire algorithm. Everything else is bookkeeping.

A worked example will make this concrete. Take the expression L = (a · b + c)² and plug in a=2, b=3, c=4. The graph has three intermediate nodes: d = a · b, then e = d + c, then L = e². The forward pass walks left to right, computing values: d = 6, then e = 10, then L = 100. The backward pass walks right to left, computing gradients of L with respect to every other node.

a=2 b=3 c=4 × d=6 + e=10 (·)² L=100 ∂L/∂e = 20 ∂L/∂d = 20 ∂L/∂c = 20 ∂L/∂a = 60 ∂L/∂b = 40 forward (black) builds values · backward (red) deposits gradients · ONE pass for both
The computational graph for L = (a·b + c)². The black arrows are the forward pass: each node's value is computed from its inputs. The red labels are the backward pass: at each node, multiply the gradient that arrived from above by the local derivative of that node, and pass the result back to the inputs. The whole tree is walked once in each direction.

Now let's actually do the backward pass, one node at a time. We start at the root and assume ∂L/∂L = 1 (the loss has gradient 1 with respect to itself by definition).

The first node on the way back is L = e². The derivative of with respect to e is 2e. We already know e = 10 from the forward pass, so the gradient at this step is ∂L/∂e = 2 · 10 = 20.

The next node is e = d + c. Addition is the easiest case: the derivative of d + c is 1 with respect to either input. So the gradient simply copies through to both branches: ∂L/∂d = 20 and ∂L/∂c = 20.

Finally, d = a · b is multiplication, where the derivative with respect to one input is the other input. So ∂L/∂a = (∂L/∂d) · b = 20 · 3 = 60, and similarly ∂L/∂b = (∂L/∂d) · a = 20 · 2 = 40. That's the entire backward pass for this graph, summarized in three lines of arithmetic:

L = e² → ∂L/∂e = 2e = 20 e = d + c → ∂L/∂d = (∂L/∂e)·1 = 20 ∂L/∂c = (∂L/∂e)·1 = 20 d = a · b → ∂L/∂a = (∂L/∂d)·b = 20·3 = 60 ∂L/∂b = (∂L/∂d)·a = 20·2 = 40

Notice the recurring shape of every step. The gradient that arrives from above (call it the upstream gradient) is multiplied by something that depends on the current node — its local Jacobian — and the result is passed back to the inputs. That is the chain rule. It does not matter how complicated the network gets: every node only ever does this one thing.

Why "reverse" mode?

You may have heard of "automatic differentiation" coming in two flavors: forward mode and reverse mode. They compute the same gradients but they walk the graph in different directions, and the choice matters enormously.

The key fact is that a neural network has many parameters and one scalar loss. Reverse mode walks the graph backward from the loss, so each edge in the graph is traversed exactly once. The total cost of computing all gradients is the same order as the forward pass — call it O(graph size). Forward mode would have to do one walk per parameter — O(graph × num_params). For a 7B-parameter LLM, that is an eight-billion-fold difference. Reverse mode is the entire reason deep learning is feasible.

What gets saved during the forward pass

To compute local Jacobians on the way back, the backward pass needs values that were computed during the forward pass. For example, to evaluate ∂(a·b)/∂a = b, the algorithm needs the value of b at the time of the multiplication. So during the forward pass, the framework caches these intermediates — known as activations — for use later.

This is also why training uses much more memory than inference. Training keeps every activation alive until the backward pass uses it. Inference can throw activations away as soon as they have been consumed by the next layer. There is a standard memory-saving trick called activation checkpointing, where you don't save activations and instead recompute them during the backward pass, trading compute for memory. We will see it again later in the curriculum.

Backpropagation was independently invented at least four times between 1960 and 1986. Henry Kelley derived it for control theory in 1960. Stuart Dreyfus rediscovered it in 1962. Paul Werbos wrote it up for neural networks in his 1974 PhD thesis, where it was politely ignored. The version that finally caught on came from Rumelhart, Hinton, and Williams in 1986. The idea wasn't new in 1986; what changed was the willingness to apply it to multi-layer networks.

Local Jacobians

A short cheat-sheet that covers about 95% of the operations in an LLM.

For any node z = f(x, y, …) with multiple inputs, the local Jacobian is a vector or matrix of partial derivatives — one entry per input. The backward pass multiplies the upstream gradient by this Jacobian to produce gradients for the inputs. Internalize the table below and you will be able to derive the backward pass for almost any expression by hand.

OpForwardLocal Jacobian
Addz = x + y∂z/∂x = 1, ∂z/∂y = 1
Multiplyz = x · y∂z/∂x = y, ∂z/∂y = x
MatmulY = X W∂L/∂X = ∂L/∂Y · Wᵀ
∂L/∂W = Xᵀ · ∂L/∂Y
ReLUz = max(0, x)∂z/∂x = 1[x > 0]
Sigmoidz = σ(x)∂z/∂x = z(1 − z)
Tanhz = tanh(x)∂z/∂x = 1 − z²
Expz = exp(x)∂z/∂x = z
Logz = log(x)∂z/∂x = 1/x
Softmax + CEL = CE(softmax(z), y)∂L/∂z = (p − one_hot(y)) / B

Matmul backward, decoded

The matmul Jacobian is the one most newcomers stumble on, because it looks more complicated than it actually is. Let's take it slowly.

Suppose we have a linear layer Y = X W, where X is a batch of inputs with shape (B, n) and W is a weight matrix with shape (n, m). Then Y has shape (B, m). During the backward pass, somebody upstream hands us the gradient of the loss with respect to Y — call it ∂L/∂Y, also with shape (B, m). Our job is to produce the gradient with respect to the input X and the gradient with respect to the weights W.

The two backward operations look like this:

∂L/∂X = ∂L/∂Y · Wᵀ shape (B, n) ∂L/∂W = Xᵀ · ∂L/∂Y shape (n, m)

Both are themselves matrix multiplications — the same operation, just with appropriate transposes. A trick that almost always works: if you cannot remember which transpose goes where, look at the shapes. The result has to come out the right size, and there is usually only one combination of inputs and transposes that produces the correct dimensions. For example, to produce something with shape (B, n) from ∂L/∂Y (shape B, m) and W (shape n, m), the only sensible matmul is ∂L/∂Y · Wᵀ.

The bigger lesson: every linear layer's backward pass costs two more matrix multiplications. A forward pass through a Transformer does roughly one matmul per linear layer. A backward pass does two. So a training step costs about three times as many FLOPs as an inference step. That ratio is fundamental to every memory and timing estimate later in the curriculum.

Backprop through a linear layer + activation — per step

By the end of this section you'll be able to name every gradient that flows through a single transformer sub-layer. Here's a concrete one-layer example: X (shape 2×3) enters a linear layer W (3×2) producing Z = XW, then a ReLU gives A = ReLU(Z), and some loss L is computed upstream. We'll walk each step.

Forward pass (black) then backward pass (red) — one step at a time Input X shape (2, 3) Linear Z = X W W shape (3, 2) Z shape (2, 2) ReLU A = max(0,Z) A shape (2, 2) Loss L scalar ∂L/∂A arrives ∂L/∂Z = ∂L/∂A ⊙ 1[Z>0] ∂L/∂X = ∂L/∂Z · Wᵀ ∂L/∂W = Xᵀ · ∂L/∂Z (used to update W) Step 1 (ReLU backward) multiply upstream ∂L/∂A by local indicator 1[Z>0] Step 2 (matmul backward)
Every backward pass is two steps per operation: (1) receive the upstream gradient from above; (2) multiply by the local Jacobian and send the result to the inputs. For the ReLU, the local Jacobian is a mask: zeros where the forward value was negative, ones where it was positive (1[Z>0]). For the matmul, the local Jacobians are transposes of the other operand. The weight gradient ∂L/∂W drops out as a side product of step 2 — that is what gets accumulated and used by the optimizer.

Concrete numbers. Say X = [[1, 2, 3], [4, 5, 6]] (2×3), W = [[1, 0], [0, 1], [1, 1]] (3×2). Then Z = XW = [[4, 5], [10, 11]]. Both entries positive, so 1[Z>0] is all-ones — the ReLU backward just passes the upstream gradient through unchanged. Then ∂L/∂W = Xᵀ · ∂L/∂Z has shape (3, 2), and ∂L/∂X = ∂L/∂Z · Wᵀ has shape (2, 3). Shapes always check out when you keep transposes consistent.

The magic gradient: softmax + cross-entropy

Most loss-and-final-layer combinations are messy on paper. The combination of softmax with cross-entropy is the spectacular exception. If you derive it by hand, the algebra involves an outer product Jacobian for the softmax that almost entirely cancels with the structure of the cross-entropy. The residue is breathtakingly clean:

∂L/∂z = (p − one_hot(y)) / B # "predicted minus correct, averaged over the batch"

That clean form is the reason every framework's cross_entropy function takes logits directly, not pre-computed probabilities. The function computes the softmax and the loss together, in one fused operation, so the magic gradient stays magical and numerically stable. If you are tempted to apply softmax yourself before passing the result into the cross-entropy function, resist. You will break the elegance and introduce a numerical issue (log of a small number) that is hard to debug.

Optimizers

SGD, then Momentum, then Adam, then AdamW. Each fixes a real problem with the previous one.

Once you have a gradient, the optimizer's job is to decide how to use it. There are four variants worth understanding deeply, because each one was introduced to solve a real problem with the one before it. Everything else you'll see in the literature is a derivative of these four.

Plain SGD

The simplest possible optimizer: take the parameter, subtract a scaled gradient, repeat.

w ← w − lr · ∇w

This works, more or less, on simple problems. Its main weakness is that the same learning rate is applied to every parameter. In a deep network, different layers and different positions in the weight matrix can have wildly different gradient scales — embedding rows tend to have very different gradient magnitudes from FFN weights, for instance. With a single global learning rate, you have to pick something that doesn't blow up the most-volatile parameters and doesn't waste time on the least-volatile ones. That compromise rarely works well at scale, which is why almost no LLM is trained with plain SGD.

SGD + Momentum

The first refinement is to keep an exponentially-weighted moving average of past gradients, and step in that direction instead of the raw gradient.

v ← β v + ∇w (β ≈ 0.9) w ← w − lr · v

v is the running average. β controls how quickly old gradients are forgotten — at the typical value of 0.9, recent gradients dominate but earlier ones still influence the average for a few dozen steps.

What does momentum buy you? Two things, both of which fall out of the same idea. The first is noise smoothing: if the gradient bounces around a bit due to mini-batch sampling, the running average is steadier than the raw gradient. The second is escape from narrow valleys. Imagine a long, narrow ravine in the loss landscape, where the gradient mostly points down the ravine but with small transverse oscillations. Plain SGD oscillates wall-to-wall. Momentum builds up speed along the ravine axis (because consecutive gradients keep pushing in roughly the same direction there) while the transverse oscillations cancel out (because they alternate sign). The optimizer ends up gliding along the ravine.

Momentum is standard in computer vision. For LLMs you almost never see it on its own, but it is one of the two ingredients in Adam.

Adam — Adaptive Moment Estimation

Adam goes one step further. Instead of just smoothing the gradient, it tracks two running averages per parameter:

m ← β₁ m + (1 − β₁) · ∇w # mean of grad (β₁=0.9) v ← β₂ v + (1 − β₂) · (∇w)² # mean of squared grad (β₂=0.999, or 0.95 for LLMs) m̂ = m / (1 − β₁ᵗ) # bias correction v̂ = v / (1 − β₂ᵗ) w ← w − lr · m̂ / (√v̂ + ε) # ε = 1e-8 typically

The first running average, m, is just the momentum term you saw above — a smoothed estimate of the gradient's mean. Adam calls this the "first moment." The second running average, v, is the smoothed mean of the gradient squared. Adam calls this the "second moment." Why track both?

Because together they tell you, for each parameter individually, how big a step to take. If a parameter's gradient has been consistently large, then v is large, √v̂ is large, and the update m̂ / √v̂ is shrunk down. If a parameter's gradient has been consistently small, v is small, and the update is amplified. The effect is that every parameter ends up with its own adaptive learning rate, automatically. That is what the "adaptive" in Adam stands for.

Why bias correction? Both m and v start at zero. For the first few hundred steps, the running averages haven't seen enough gradient samples yet — they badly underestimate the true mean and second moment. Without correction, the very first updates would be enormous and the optimizer would be unstable. The bias-correction terms 1 / (1 − βᵗ) divide out exactly this initialization bias. As t grows, both bias-correction terms approach 1 and have no effect.

The practical consequence of all this: Adam works at lr = 1e-3 on a startling variety of problems, because the per-parameter √v̂ rescaling has already normalized away most of the per-parameter scale variation. You don't have to tune the learning rate as carefully as you do with SGD.

AdamW — Decoupled Weight Decay

The original Adam paper folded L2 regularization into the gradient, by adding λw to ∇w before computing m and v. On paper this looks like ordinary weight decay — the gradient of an L2 penalty is exactly λw. In practice it interacts badly with Adam.

The problem is that the regularization strength gets divided by √v̂ along with the gradient. So a parameter with a small v sees its weight decay amplified, while a parameter with a large v sees its weight decay shrunk. The result is that the actual amount of regularization depends on which parameters happen to have noisy gradients — a confusing, hard-to-tune coupling.

Loshchilov and Hutter fixed this in 2017 by decoupling weight decay from the moment-based update. AdamW applies weight decay directly to the parameter, untouched by the second-moment scaling:

w ← w − lr · m̂ / (√v̂ + ε) − lr · λ · w

The result is predictable, stable, and transferable. AdamW is the default for every modern LLM. GPT-3, LLaMA, Mistral — all use it. A typical weight decay setting is λ = 0.1.

Optimizer comparison — all in one table

Here is the complete family, so you can see the evolution at a glance. "States per param" determines the optimizer's memory footprint — critical at billion-parameter scale.

OptimizerUpdate rule (core)States / paramKey hyperparamsWhen to use
SGD w ← w − lr · ∇w 0 lr Computer vision with carefully tuned LR; rarely for LLMs
SGD + Momentum v ← βv + ∇w; w ← w − lr · v 1 (v) lr, β ≈ 0.9 Vision pretraining; faster than plain SGD on narrow valleys
RMSProp v ← β₂v + (1−β₂)g²; w ← w − lr·g/√(v+ε) 1 (v) lr, β₂ ≈ 0.999 RNNs and non-stationary objectives; predecessor to Adam
Adam w ← w − lr · m̂/(√v̂+ε) 2 (m, v) lr, β₁=0.9, β₂=0.999, ε=1e-8 Most DL tasks; strong default for fine-tuning
AdamW w ← w − lr · m̂/(√v̂+ε) − lr·λ·w 2 (m, v) lr, β₁, β₂, ε, λ=0.1 Default for all modern LLM pretraining and fine-tuning
Lion w ← w − lr · sign(β₁m + (1−β₁)g) 1 (m) lr, β₁=0.9, β₂=0.99, λ Memory-constrained training; some Google runs; needs 3–10× lower lr than Adam
AdaFactor Factored second moment (row + col) ~2/rank of W Often no lr needed (relative step size) Very large models where optimizer state dominates memory (T5, PaLM)

Memory math for a 7B model. 7B parameters × 4 bytes each = 28 GB for weights. Adam (2 states × 4 bytes × 7B) = another 56 GB — twice the weight cost, just for optimizer state. That is why inference needs only the weights (28 GB in FP32, 14 GB in FP16) while training of the same model needs at minimum ~84 GB in FP32. Lion halves the optimizer state to 28 GB by using only one running state.

Other optimizers worth recognizing

  • Sophia (2023) — uses second-order curvature information. Faster on some LLM benchmarks, but less battle-tested than AdamW.
  • Shampoo — full-matrix preconditioner. Powerful but expensive to compute per step.

For this curriculum, the default is AdamW unless we explicitly say otherwise.

Trajectories on a long, narrow valley (loss surface top view) min SGD SGD + Momentum Adam / AdamW start steep direction shallow direction
Top-down view of a loss surface. The contour ellipses show an ill-conditioned valley — one direction is 10× steeper than the other. SGD overshoots in the steep direction and oscillates. Momentum damps the oscillations but still curves. Adam's per-parameter √v̂ divides the steep direction's gradient by its own RMS, effectively re-rounding the ellipses into near-circles — so Adam tracks the valley axis directly.

Adam stands for ADaptive Moment estimation. Kingma and Ba published it in 2014, and within months it had become the default optimizer in deep learning. Three years later, Loshchilov and Hutter (2017) noticed that Adam was applying weight decay incorrectly. Their fix is AdamW, now the standard for every LLM you've heard of. The lesson is humbling: a subtle bug sat inside the most popular optimizer in machine learning for three years before anyone wrote a paper about it.

Learning-Rate Schedules

Linear warmup, then cosine decay. Why this combination became the default.

A constant learning rate is rarely optimal. The schedule that essentially every modern LLM uses is the same five-line function: linearly ramp the learning rate from 0 to the target peak over the first few hundred to few thousand steps, and then cosine-decay it down to nearly zero over the rest of training.

def lr_at(step, peak_lr=3e-4, warmup=500, total=10000):
    if step < warmup:
        return peak_lr * step / warmup
    progress = (step - warmup) / (total - warmup)
    return peak_lr * 0.5 * (1 + math.cos(math.pi * progress))
step 0 500 10000 peak lr 0 warmup cosine decay → 0
A linear ramp from zero to the peak learning rate, followed by a cosine curve all the way back down. The shape may look arbitrary but each phase has a job to do. The ramp avoids early instability while Adam's running statistics warm up. The cosine spends most of its budget at moderate-to-high learning rates and only decays sharply near the end.

Why warmup?

At the very beginning of training, Adam's second-moment estimate is unreliable — there simply haven't been enough gradient samples yet to get a stable variance estimate. Combine a high learning rate with a noisy and the update m̂ / √v̂ can spike to enormous values. The result is one bad step that destabilizes the model and possibly NaN's it out. Warmup avoids this by starting the learning rate at zero and ramping it up linearly, so by the time you reach the peak lr, the running statistics have stabilized.

Why cosine decay?

Cosine has two practical virtues. It is smooth, so there are no manual step decisions to tune. And it is "lazy" — it spends most of training near the peak rate and only really decays sharply in the last fraction of the run. Empirically, cosine consistently produces lower final loss than constant learning rate or step decay on LLM workloads. It has become the default for that reason, not because of any deep theoretical justification.

Picking the peak learning rate

The standard procedure is to run a learning-rate sweep. Pick five values an order of magnitude apart — say 1e-1, 1e-2, 1e-3, 1e-4, 1e-5 — and train each for a few hundred steps. Plot the final loss against the learning rate on a log scale. You will see a U-shape: too small, the model crawls; too large, it diverges; in between, it is fine. Pick the minimum.

Rough magnitudes to expect:

  • 1e-3 — tiny models, under about 10M parameters.
  • 1e-4 to 3e-4 — around 100M parameters.
  • 3e-4 to 6e-4 — billion-scale models. Yes, smaller for bigger models. Adam's normalization still works, but the absolute scale of useful updates tends to be smaller in higher-dimensional spaces.

Other schedules you'll encounter

  • Constant. Just keep the learning rate fixed. Fine for short fine-tunes; rarely the best choice for full pretraining.
  • Step decay. Drop the learning rate by 10× every N epochs. Used historically for ResNets.
  • Inverse-sqrt. The original Transformer paper used lr ∝ 1/√t after warmup. Older recipe; cosine has largely replaced it.
  • One-cycle (Smith). Warmup then anneal, popular in fast.ai. Similar in shape to warmup-plus-cosine.
Initialization

A bad init can doom training before the first step. Two principles, two formulas.

You can have a perfect optimizer, a perfect loss, and a perfect dataset, and still completely fail to train a network if you initialize its weights poorly. Two principles drive any sensible initialization scheme.

The first principle is to break symmetry. If you initialize every weight to zero (or to any single value), every neuron in a given layer computes the same thing on the same input. Their gradients then come out identical. The next update changes every neuron in the same way. The neurons stay in lockstep forever. The network never learns. The fix is to initialize with random values, so that each neuron starts in a slightly different place and gradients quickly diverge.

The second principle is to keep activations and gradients well-scaled. If the initial scale is too small, activations shrink as they pass through layers and gradients vanish to zero. If the initial scale is too large, activations grow and saturate, and the gradient again vanishes (because activation functions like sigmoid have nearly-zero derivatives in their saturated regions). The right scale depends on the activation function and the layer width.

Xavier / Glorot — for tanh and sigmoid

W ~ N(0, 2 / (fan_in + fan_out))

Xavier init was designed to keep activation variance approximately constant across layers, under the assumption that the activation function is roughly linear near the origin. That assumption holds for tanh and sigmoid in their useful range, so Xavier works well for those.

He / Kaiming — for ReLU

W ~ N(0, 2 / fan_in)

For ReLU, Xavier is too small. The reason is that ReLU zeros out roughly half of its inputs (the negative half). After ReLU, activation variance is roughly halved. Without compensation, variance keeps halving with every layer, and gradients vanish exponentially in depth. He init compensates by doubling the input variance — that is where the factor of 2 in 2/fan_in comes from. With He init for ReLU layers, activation variance stays roughly constant across the depth of the network.

Transformer practice

Most modern LLMs don't strictly follow He or Xavier — they use a fixed small normal distribution instead. The reasons are partly historical (it's what GPT-2 used and worked well) and partly that AdamW's adaptive scaling absorbs much of the input-scale variation that Xavier and He are trying to compensate for.

  • Linear weights: N(0, 0.02), the GPT-2 convention.
  • Attention output projection and the final MLP layer: scaled down by an additional 1/√(2L), where L is the number of layers. This extra factor keeps deeper networks stable at initialization.
  • LayerNorm scale γ: 1. LayerNorm bias β: 0.
  • Embedding: N(0, 0.02), sometimes with a different scale.

The exact recipe is documented in Section 2.3 of the GPT-2 paper, and LLaMA-2 uses similar constants.

Why bias initialized to zero is fine

You might wonder: if symmetry must be broken, why is it OK to initialize every bias to zero? The answer goes back to Day 3. The bias gradient is db = dz.sum(axis=0), where dz is the upstream gradient. Even if every bias starts at zero, dz differs across neurons (because the weights are random), so different biases receive different gradients on the very first backward pass. Symmetry breaks immediately. Weights, by contrast, would stay locked together if initialized equal.

Pathologies

Six failure modes you will see in practice, and what to do about each.

Training a neural network goes wrong in fairly predictable ways. Recognizing each pattern by its signature in the loss curve and gradient-norm trace will save you days of confusion.

SymptomLikely causeCure
Loss → NaN Learning rate too high; exploding gradients; numerical issues like dividing by something close to zero in softmax or RMSNorm Lower the learning rate; add gradient clipping with max_norm = 1.0; make sure your softmax uses the max-subtraction trick
Loss flat at log(C) Symmetry not broken (zero init); or you applied softmax before passing into F.cross_entropy, ruining the magic gradient Use random init; pass logits (not probabilities) to the cross-entropy function
Loss decreases then plateaus high Model under-capacity; or the learning-rate scheduler ended too early Bigger model; longer schedule
Train loss far below val loss Overfitting Increase weight decay (AdamW λ); add dropout; collect more data
Gradient norm spikes Outlier batch; numerical instability Gradient clipping by total norm
Activations all zero in some neurons Dead ReLUs (initial bias too negative, or learning rate too high) He init; switch to LeakyReLU or GELU; lower learning rate

Vanishing gradients — why depth was hard before residuals

By the end of this section you'll understand why residual connections are not a nice-to-have but a hard requirement for deep networks. You don't need to know the full Transformer architecture yet — just file this away and it will click on Day 7.

Vanishing gradients happen when gradients shrink toward zero as they propagate back through layers. The intuition: if each layer's local Jacobian has values slightly less than 1, those values multiply together as we walk backward. With 50 layers, a factor of 0.9 per layer gives 0.9⁵⁰ ≈ 0.005 — the gradient at layer 1 is 200× smaller than at layer 50. Early layers receive almost no learning signal and fail to train.

Sigmoid and tanh are the classic culprits. Sigmoid's derivative is σ(x)(1−σ(x)), which peaks at 0.25 and falls toward zero for large positive or negative inputs. If activations drift outside the linear region, the gradient effectively stops.

Gradient norm per layer — vanishing vs residual network layer (10 = output, 1 = input) gradient norm 10 8 6 4 2 ←output sigmoid / deep (gradient vanishes) residual network (gradient stays healthy)
Gradient norm plotted per layer (reading right-to-left = backward pass direction). In a deep sigmoid network, each layer multiplies the gradient by its local Jacobian (max 0.25), compounding exponentially until early layers see near-zero signal. A residual network adds an identity shortcut that gives gradients a direct path back, keeping the norm roughly constant regardless of depth. You don't need to know how residuals are implemented yet — Day 7 covers that in full.

The modern Transformer recipe is engineered from the ground up to defeat vanishing gradients: ReLU and GELU have derivative ≈ 1 over half their domain; He init keeps activation variance constant across depth; residual connections (Day 7) provide gradient highways; LayerNorm (Day 7) re-centers activations every layer. Together these make 100-layer networks routine. You don't need to know the mechanism yet — just hold onto the motivation.

Exploding gradients

The mirror image. If local Jacobians are consistently > 1, gradients grow exponentially and weights blow up to NaN. The fix is gradient clipping:

total_norm = √(Σ_i ||g_i||²) # L2 norm over ALL parameter gradients combined if total_norm > max_norm: scale = max_norm / total_norm g_i ← g_i · scale for all i

In PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). One line, one number to tune, prevents most LLM training runs from blowing up. The typical value max_norm = 1.0 is a near-universal default for LLMs. Gradient norm itself is worth logging — a spike in the norm trace is often the first visible sign of a data problem or numerical instability, appearing several steps before the loss diverges.

Dead ReLUs

A ReLU neuron is "dead" when its pre-activation is always negative on every input it sees. The output is then always zero, the local derivative is zero, no gradient flows to the incoming weights, and the neuron stays dead forever. This usually happens when the learning rate is too high (one bad step pushes the bias deeply negative) or initialization is poor.

Cures in order of effectiveness: He init (places activations in the right magnitude to start), lower learning rate (smaller steps don't overshoot), and switch to LeakyReLU or GELU (which have small but nonzero derivatives for negative inputs, so a "dead" neuron can revive itself).

The "can you overfit a single batch?" sanity check

Before you debug anything else, run this test. Take a single batch, repeat it for a hundred steps, and see if the model can memorize it perfectly. If it cannot, you have a real bug — not a hyperparameter problem. The bug is usually the wrong loss function, a frozen parameter (you forgot to enable gradients), or a miscoded forward pass that silently produces wrong outputs. Fix the bug first; then tune.

Why This Matters for Inference

The backward pass is training-only. Inference is forward-only — and that changes everything.

Every concept in this lesson — gradients, optimizer states, activation caching — exists solely to support training. At inference time, none of it runs. That single fact is the origin of the enormous memory gap between training a model and serving it.

What inference omits

During a forward pass at inference time you need: the model weights, and the activations of the current layer (for the next layer's matmul). As soon as a layer's output has been used, its input activations can be freed. At any moment you hold at most two layers' activations in memory — not all of them at once.

During training the forward pass must keep every activation alive until the backward pass has used it. For a deep Transformer with long sequences, this can be gigabytes of intermediate tensors. Activation checkpointing (recomputing some activations on the fly during backward) trades compute for memory, but still requires re-running a portion of the forward pass. Inference never has this problem at all.

Optimizer state: the silent training-only cost

Adam's running states (m and v) exist only during training. For a model with P parameters in FP32, the full training memory bill is roughly:

Training memory ≈ 4P (weights) + 4P (gradients) + 4P (Adam m) + 4P (Adam v) = 16P bytes in FP32 Inference memory ≈ 2P bytes (weights in FP16/BF16, no grads, no optimizer state)

For a 7B-parameter model: training ≈ 112 GB, inference ≈ 14 GB. An 8× difference, entirely from removing training-only state. This is why a model that needed 8 A100s to train can be served on a single one (or even a consumer GPU after quantization).

Forward reference: what's coming

Day 10 will build the full training loop, connecting backprop + AdamW + gradient clipping + mixed precision into a working training harness. Week 3 covers serving — where the absence of backward pass is what makes large-batch, low-latency inference tractable. You don't need to internalize those details now; just remember that everything in this lesson costs zero at inference time.

Memory needed for a 7B model: training vs inference (rough FP32 / FP16 estimates) Training weights 28 GB (FP32) gradients 28 GB Adam m 28 GB Adam v 28 GB ≈ 112 GB total 8× difference Inference weights 14 GB (FP16/BF16) no gradients no optimizer state activations: 1-2 layers only ≈ 14 GB total
Training a 7B model in FP32 requires roughly 112 GB — weights, gradients, and two Adam running states, each 28 GB. Inference of the same model in FP16 requires only the weights at 14 GB. Everything else is training-only state that disappears at serving time. This is the single most important memory fact in LLM deployment.

Lion, a more recent optimizer used by Google for some training runs, was discovered by an automated search. The algorithm wasn't designed by a human; it was found by symbolic search over 10⁹ candidate update rules. The winning rule keeps only one running state per parameter, so it uses half the memory of Adam. The fact that a brute-force search produced a competitive update rule is striking — it suggests the design space of optimizers may be much richer than human researchers have explored.

Mixed Precision Preview

FP32 master weights, BF16 forward and backward, FP32 optimizer step.

Day 10 will cover this in depth, but here is the headline. Modern training keeps the master copy of the weights in FP32 for accumulation accuracy, but runs the forward and backward passes in BF16 (or FP16) for memory and speed. The optimizer step then happens back in FP32. The result is roughly a 2× memory reduction and a 2–4× throughput improvement on Tensor Core hardware, with no measurable quality loss.

BF16 has the same exponent range as FP32 (just with less mantissa precision), so no special tricks are needed. FP16 has a narrower exponent range and can underflow gradients to zero, so it requires an additional helper called torch.amp.GradScaler('cuda') that scales the loss up before backward and down before the optimizer step, keeping the gradients in FP16's representable range.

Floating-point formats: sign · exponent (range) · mantissa (precision) sign exponent → range mantissa → precision FP32 8 exp 23 mantissa 32 bits FP16 5 exp 10 mantissa 16 bits · narrow range, can underflow BF16 8 exp 7 mant 16 bits · FP32's range, less precision same exponent width as FP32 → same dynamic range, no GradScaler needed
All three formats spend one bit on sign. The exponent sets the range of representable magnitudes; the mantissa sets the precision. BF16 keeps FP32's full 8-bit exponent — so it covers the same range and rarely underflows — and pays for it in mantissa bits. FP16 splits its 16 bits differently, gaining precision but losing range, which is why it needs loss scaling.

The PyTorch idiom is a one-liner:

with torch.amp.autocast('cuda', dtype=torch.bfloat16):
    logits = model(x)
    loss = F.cross_entropy(logits, y)
loss.backward()
optimizer.step()

For now: when you see autocast or bfloat16 in modern training code, this is the technique it's invoking.

Exercise

Six exercises, from whiteboard to notebook.

The companion notebook day-4-backprop-optimizers.ipynb walks through each one. Highlights:

1. Pen-and-paper backprop

Compute ∂L/∂W and ∂L/∂b for L = (sigmoid(W·x + b) − y)², where x is a vector, b is a scalar bias, and y is a scalar target. Show every chain-rule step. Then verify your analytic gradients by finite difference: perturb W and b by a small h, recompute L, and compare (L_plus − L_minus) / (2h) to the analytic derivative.

2. Adam from scratch in NumPy

Implement the Adam update with bias correction. Test it on a quadratic problem f(W) = Wᵀ A W for a fixed positive-definite A. Compare the loss-vs-step curves for SGD, SGD+Momentum, and Adam at the same learning rate. With an ill-conditioned A, Adam should crush both of the others.

3. AdamW vs Adam on the Day 3 MLP

Train your Day 3 MNIST MLP with both optimizers, weight decay 0.01, three epochs each. Compare validation accuracy. AdamW tends to generalize slightly better; you should be able to see the difference.

4. Learning-rate sweep, find the U-curve

Train at lr ∈ {1e-1, 1e-2, 1e-3, 1e-4, 1e-5}. Plot final training loss vs learning rate on a log-log plot. The minimum of the U is your peak lr for this model and dataset.

5. Linear warmup + cosine schedule

Implement lr_at(step) from the schedule section. Plot the curve. Train your MLP with this schedule and compare its final loss to a constant-LR baseline.

6. Gradient clipping rescue

Crank the learning rate until your MLP NaN's out. Then add clip_grad_norm_(params, 1.0) and re-run with the same too-high LR. With clipping, training should survive — slowly — instead of diverging.

Self-Check

Nine questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

  1. Why does backprop run backward? What's the cost difference compared to forward-mode automatic differentiation, for a network with many parameters and one scalar loss?
  2. Walk the backward pass through L = (a·b + c)² with a=2, b=3, c=4 from memory. State ∂L/∂a, ∂L/∂b, and ∂L/∂c.
  3. You have a linear layer Z = XW (X is 4×8, W is 8×16, so Z is 4×16). The upstream gradient ∂L/∂Z has shape 4×16. What are the shapes of ∂L/∂X and ∂L/∂W? Write the formulas.
  4. In Adam, what does the √v̂ term effectively do? Why is the algorithm called adaptive?
  5. Why is AdamW preferred over Adam for LLM training? What specifically changed between the two algorithms?
  6. State three distinct symptoms of training going wrong, and one cure for each.
  7. Your loss is exactly log(vocab_size) on the first step. Is that good or bad? Why?
  8. Why does a residual connection (x + f(x)) help training depth? (You don't need to know how Transformers implement it — just the gradient argument.)
  9. A 7B-parameter model costs roughly 14 GB to serve in FP16. Why does training the same model require roughly 112 GB? Name every component of the difference.

"The chain rule is the most important rule in machine learning."

Day 4
Further Reading

Go deeper.

Hand-picked references for this lesson. Free where possible.

Repo · Karpathy

karpathy/micrograd

Autograd in 150 lines of Python. Reading this end to end will demystify every framework's backward pass.

View repo
Blog · Visual

Olah — Calculus on Computational Graphs

Christopher Olah's clean visual explanation of forward vs reverse mode autodiff. Exceptional clarity.

Read post
Blog · Karpathy

Yes, you should understand backprop

Practical reasons why you should be able to derive backward passes by hand — and the bugs that hide otherwise.

Read post
Paper · 2014

Kingma & Ba — Adam Optimizer

The original Adam paper. Read Section 2 (algorithm) and Section 3 (convergence analysis).

Open paper
Paper · 2017

Loshchilov & Hutter — AdamW

Decoupled weight decay regularization. The fix that made AdamW the LLM default.

Open paper
Blog · Survey

Ruder — Gradient Descent Algorithms

Sebastian Ruder's comprehensive survey of optimization algorithms with intuition for each.

Read post
Paper · 2015

He et al. — Delving Deep into Rectifiers

Where He / Kaiming initialization comes from. Section 2 derives the 2/fan_in constant.

Open paper
Paper · 2016

SGDR — Cosine Annealing

Loshchilov & Hutter on cosine schedules with warm restarts.

Open paper
YouTube · 4 hr

Karpathy — Reproduce GPT-2 (124M)

Live coding. He walks through optimizer and scheduler choices for a real LLM training run.

Watch on YouTube
Paper · 2023

Lion — Symbolic Discovery of Optimization Algorithms

Google's automated search for an optimizer. Found a single-state update rule (no v) that's competitive with AdamW.

Open paper
Course · Stanford CS336

Stanford CS336 — Language Models from Scratch

MIT-level course building a real LLM from scratch, including optimizer and training-loop lectures. Lecture notes are publicly available.

View course
Paper · 2016

Ba et al. — Layer Normalization

The LayerNorm paper — the normalization technique used in every Transformer. Read alongside the vanishing-gradient section to see why normalization is necessary at depth.

Open paper