Yesterday you wrote a backward pass by hand. Today we make it general. Backpropagation turns out to be a single idea — the chain rule, applied to a graph of operations, walked in reverse — and once you see that idea, every framework's autograd stops feeling like magic.
On Day 3 we wrote the backward pass for a 2-layer MLP by hand. It was six lines of NumPy, and you understood every term. That works wonderfully for a small network, but it does not scale. A real Transformer has hundreds of operations per layer, and writing every backward pass by hand would take days and produce more bugs than working code.
So today we generalize. The good news is that there is only one idea to learn. Backpropagation is the chain rule from calculus, applied to a computational graph, walked in reverse from the loss back to the parameters. Every framework's automatic differentiation system — PyTorch's loss.backward(), MLX's mx.grad, JAX's grad — is a careful mechanization of that one idea. Once you have seen the shape of the algorithm clearly, the libraries stop feeling magical and start feeling like patient bookkeepers.
After backprop we turn to optimizers. An optimizer is the thing that decides how to use a gradient once you have one. Plain SGD does the obvious thing: subtract some fraction of the gradient from each parameter. Real LLM training does something more elaborate — it uses AdamW with warmup and cosine decay. That recipe took the field about a decade to arrive at. Knowing why each ingredient is in there — momentum, second-moment scaling, decoupled weight decay, warmup, cosine — is what separates "I copied the recipe" from "I can debug a training run when something goes wrong."
Three concrete reference points for what builds on this lesson:
m and v to estimate fine-tune cost.2/fan_in constant in He's formula comes from.This is a math-heavy lesson. Every symbol is defined here in plain English and with a Python analogy. When you see a symbol later, come back to this table. You don't need to memorize it now — just know it exists.
| Symbol | Reads as | Python analogy | Example in this lesson |
|---|---|---|---|
∂L/∂w | "partial derivative of L with respect to w" | (L(w+h) - L(w-h)) / (2*h) for tiny h | How much does the loss change if we nudge weight w? |
∇L | "gradient of L" — pronounced "nabla L" or "del L" | [dL/dw for w in params] — a list of partials | The full gradient vector over all parameters |
∂z/∂x · ∂L/∂z | chain rule — multiply local derivative by upstream gradient | local_grad * upstream_grad | Every backward pass node does exactly this |
Wᵀ | "W transposed" — rows and columns swapped | W.T | ∂L/∂X = ∂L/∂Y · Wᵀ |
β₁, β₂ | "beta one, beta two" — exponential decay factors | scalars between 0 and 1; e.g. 0.9, 0.999 | Adam's momentum and second-moment decay rates |
m̂, v̂ | "m-hat, v-hat" — bias-corrected estimates | m / (1 - beta1**t) | Adam corrects early underestimation of moments |
ε | "epsilon" — a tiny constant added for stability | 1e-8 | Prevents divide-by-zero in Adam: m̂ / (√v̂ + ε) |
λ | "lambda" — weight decay strength | weight_decay = 0.1 | AdamW subtracts lr · λ · w directly |
√x | "square root of x" | x ** 0.5 or math.sqrt(x) | Adam: √v̂ is the per-parameter RMS gradient |
1[x > 0] | "indicator function — 1 if x > 0, else 0" | float(x > 0) | ReLU local derivative |
O(n) | "order n" — grows proportionally to n | like saying "scales linearly with n" | Reverse-mode autodiff is O(graph size) |
The one thing to hold onto before reading further: ∂L/∂w is just a number that answers the question "if I add a tiny bit to w, how much does L change?" The chain rule lets you compute that number for every weight in a million-parameter network in a single backward pass.
Every computation a neural network performs can be drawn as a directed acyclic graph. The leaves are the inputs (data) and the parameters (weights and biases). The interior nodes are arithmetic operations. The root is the loss. Forward computation goes from the leaves to the root, computing one node value at a time. Backward computation goes from the root back to the leaves, computing one gradient at a time. That is the entire algorithm. Everything else is bookkeeping.
A worked example will make this concrete. Take the expression L = (a · b + c)² and plug in a=2, b=3, c=4. The graph has three intermediate nodes: d = a · b, then e = d + c, then L = e². The forward pass walks left to right, computing values: d = 6, then e = 10, then L = 100. The backward pass walks right to left, computing gradients of L with respect to every other node.
L = (a·b + c)². The black arrows are the forward pass: each node's value is computed from its inputs. The red labels are the backward pass: at each node, multiply the gradient that arrived from above by the local derivative of that node, and pass the result back to the inputs. The whole tree is walked once in each direction.Now let's actually do the backward pass, one node at a time. We start at the root and assume ∂L/∂L = 1 (the loss has gradient 1 with respect to itself by definition).
The first node on the way back is L = e². The derivative of e² with respect to e is 2e. We already know e = 10 from the forward pass, so the gradient at this step is ∂L/∂e = 2 · 10 = 20.
The next node is e = d + c. Addition is the easiest case: the derivative of d + c is 1 with respect to either input. So the gradient simply copies through to both branches: ∂L/∂d = 20 and ∂L/∂c = 20.
Finally, d = a · b is multiplication, where the derivative with respect to one input is the other input. So ∂L/∂a = (∂L/∂d) · b = 20 · 3 = 60, and similarly ∂L/∂b = (∂L/∂d) · a = 20 · 2 = 40. That's the entire backward pass for this graph, summarized in three lines of arithmetic:
Notice the recurring shape of every step. The gradient that arrives from above (call it the upstream gradient) is multiplied by something that depends on the current node — its local Jacobian — and the result is passed back to the inputs. That is the chain rule. It does not matter how complicated the network gets: every node only ever does this one thing.
You may have heard of "automatic differentiation" coming in two flavors: forward mode and reverse mode. They compute the same gradients but they walk the graph in different directions, and the choice matters enormously.
The key fact is that a neural network has many parameters and one scalar loss. Reverse mode walks the graph backward from the loss, so each edge in the graph is traversed exactly once. The total cost of computing all gradients is the same order as the forward pass — call it O(graph size). Forward mode would have to do one walk per parameter — O(graph × num_params). For a 7B-parameter LLM, that is an eight-billion-fold difference. Reverse mode is the entire reason deep learning is feasible.
To compute local Jacobians on the way back, the backward pass needs values that were computed during the forward pass. For example, to evaluate ∂(a·b)/∂a = b, the algorithm needs the value of b at the time of the multiplication. So during the forward pass, the framework caches these intermediates — known as activations — for use later.
This is also why training uses much more memory than inference. Training keeps every activation alive until the backward pass uses it. Inference can throw activations away as soon as they have been consumed by the next layer. There is a standard memory-saving trick called activation checkpointing, where you don't save activations and instead recompute them during the backward pass, trading compute for memory. We will see it again later in the curriculum.
Backpropagation was independently invented at least four times between 1960 and 1986. Henry Kelley derived it for control theory in 1960. Stuart Dreyfus rediscovered it in 1962. Paul Werbos wrote it up for neural networks in his 1974 PhD thesis, where it was politely ignored. The version that finally caught on came from Rumelhart, Hinton, and Williams in 1986. The idea wasn't new in 1986; what changed was the willingness to apply it to multi-layer networks.
For any node z = f(x, y, …) with multiple inputs, the local Jacobian is a vector or matrix of partial derivatives — one entry per input. The backward pass multiplies the upstream gradient by this Jacobian to produce gradients for the inputs. Internalize the table below and you will be able to derive the backward pass for almost any expression by hand.
| Op | Forward | Local Jacobian |
|---|---|---|
| Add | z = x + y | ∂z/∂x = 1, ∂z/∂y = 1 |
| Multiply | z = x · y | ∂z/∂x = y, ∂z/∂y = x |
| Matmul | Y = X W | ∂L/∂X = ∂L/∂Y · Wᵀ∂L/∂W = Xᵀ · ∂L/∂Y |
| ReLU | z = max(0, x) | ∂z/∂x = 1[x > 0] |
| Sigmoid | z = σ(x) | ∂z/∂x = z(1 − z) |
| Tanh | z = tanh(x) | ∂z/∂x = 1 − z² |
| Exp | z = exp(x) | ∂z/∂x = z |
| Log | z = log(x) | ∂z/∂x = 1/x |
| Softmax + CE | L = CE(softmax(z), y) | ∂L/∂z = (p − one_hot(y)) / B |
The matmul Jacobian is the one most newcomers stumble on, because it looks more complicated than it actually is. Let's take it slowly.
Suppose we have a linear layer Y = X W, where X is a batch of inputs with shape (B, n) and W is a weight matrix with shape (n, m). Then Y has shape (B, m). During the backward pass, somebody upstream hands us the gradient of the loss with respect to Y — call it ∂L/∂Y, also with shape (B, m). Our job is to produce the gradient with respect to the input X and the gradient with respect to the weights W.
The two backward operations look like this:
Both are themselves matrix multiplications — the same operation, just with appropriate transposes. A trick that almost always works: if you cannot remember which transpose goes where, look at the shapes. The result has to come out the right size, and there is usually only one combination of inputs and transposes that produces the correct dimensions. For example, to produce something with shape (B, n) from ∂L/∂Y (shape B, m) and W (shape n, m), the only sensible matmul is ∂L/∂Y · Wᵀ.
The bigger lesson: every linear layer's backward pass costs two more matrix multiplications. A forward pass through a Transformer does roughly one matmul per linear layer. A backward pass does two. So a training step costs about three times as many FLOPs as an inference step. That ratio is fundamental to every memory and timing estimate later in the curriculum.
By the end of this section you'll be able to name every gradient that flows through a single transformer sub-layer. Here's a concrete one-layer example: X (shape 2×3) enters a linear layer W (3×2) producing Z = XW, then a ReLU gives A = ReLU(Z), and some loss L is computed upstream. We'll walk each step.
1[Z>0]). For the matmul, the local Jacobians are transposes of the other operand. The weight gradient ∂L/∂W drops out as a side product of step 2 — that is what gets accumulated and used by the optimizer.Concrete numbers. Say X = [[1, 2, 3], [4, 5, 6]] (2×3), W = [[1, 0], [0, 1], [1, 1]] (3×2). Then Z = XW = [[4, 5], [10, 11]]. Both entries positive, so 1[Z>0] is all-ones — the ReLU backward just passes the upstream gradient through unchanged. Then ∂L/∂W = Xᵀ · ∂L/∂Z has shape (3, 2), and ∂L/∂X = ∂L/∂Z · Wᵀ has shape (2, 3). Shapes always check out when you keep transposes consistent.
Most loss-and-final-layer combinations are messy on paper. The combination of softmax with cross-entropy is the spectacular exception. If you derive it by hand, the algebra involves an outer product Jacobian for the softmax that almost entirely cancels with the structure of the cross-entropy. The residue is breathtakingly clean:
That clean form is the reason every framework's cross_entropy function takes logits directly, not pre-computed probabilities. The function computes the softmax and the loss together, in one fused operation, so the magic gradient stays magical and numerically stable. If you are tempted to apply softmax yourself before passing the result into the cross-entropy function, resist. You will break the elegance and introduce a numerical issue (log of a small number) that is hard to debug.
Once you have a gradient, the optimizer's job is to decide how to use it. There are four variants worth understanding deeply, because each one was introduced to solve a real problem with the one before it. Everything else you'll see in the literature is a derivative of these four.
The simplest possible optimizer: take the parameter, subtract a scaled gradient, repeat.
This works, more or less, on simple problems. Its main weakness is that the same learning rate is applied to every parameter. In a deep network, different layers and different positions in the weight matrix can have wildly different gradient scales — embedding rows tend to have very different gradient magnitudes from FFN weights, for instance. With a single global learning rate, you have to pick something that doesn't blow up the most-volatile parameters and doesn't waste time on the least-volatile ones. That compromise rarely works well at scale, which is why almost no LLM is trained with plain SGD.
The first refinement is to keep an exponentially-weighted moving average of past gradients, and step in that direction instead of the raw gradient.
v is the running average. β controls how quickly old gradients are forgotten — at the typical value of 0.9, recent gradients dominate but earlier ones still influence the average for a few dozen steps.
What does momentum buy you? Two things, both of which fall out of the same idea. The first is noise smoothing: if the gradient bounces around a bit due to mini-batch sampling, the running average is steadier than the raw gradient. The second is escape from narrow valleys. Imagine a long, narrow ravine in the loss landscape, where the gradient mostly points down the ravine but with small transverse oscillations. Plain SGD oscillates wall-to-wall. Momentum builds up speed along the ravine axis (because consecutive gradients keep pushing in roughly the same direction there) while the transverse oscillations cancel out (because they alternate sign). The optimizer ends up gliding along the ravine.
Momentum is standard in computer vision. For LLMs you almost never see it on its own, but it is one of the two ingredients in Adam.
Adam goes one step further. Instead of just smoothing the gradient, it tracks two running averages per parameter:
The first running average, m, is just the momentum term you saw above — a smoothed estimate of the gradient's mean. Adam calls this the "first moment." The second running average, v, is the smoothed mean of the gradient squared. Adam calls this the "second moment." Why track both?
Because together they tell you, for each parameter individually, how big a step to take. If a parameter's gradient has been consistently large, then v is large, √v̂ is large, and the update m̂ / √v̂ is shrunk down. If a parameter's gradient has been consistently small, v is small, and the update is amplified. The effect is that every parameter ends up with its own adaptive learning rate, automatically. That is what the "adaptive" in Adam stands for.
Why bias correction? Both m and v start at zero. For the first few hundred steps, the running averages haven't seen enough gradient samples yet — they badly underestimate the true mean and second moment. Without correction, the very first updates would be enormous and the optimizer would be unstable. The bias-correction terms 1 / (1 − βᵗ) divide out exactly this initialization bias. As t grows, both bias-correction terms approach 1 and have no effect.
The practical consequence of all this: Adam works at lr = 1e-3 on a startling variety of problems, because the per-parameter √v̂ rescaling has already normalized away most of the per-parameter scale variation. You don't have to tune the learning rate as carefully as you do with SGD.
The original Adam paper folded L2 regularization into the gradient, by adding λw to ∇w before computing m and v. On paper this looks like ordinary weight decay — the gradient of an L2 penalty is exactly λw. In practice it interacts badly with Adam.
The problem is that the regularization strength gets divided by √v̂ along with the gradient. So a parameter with a small v sees its weight decay amplified, while a parameter with a large v sees its weight decay shrunk. The result is that the actual amount of regularization depends on which parameters happen to have noisy gradients — a confusing, hard-to-tune coupling.
Loshchilov and Hutter fixed this in 2017 by decoupling weight decay from the moment-based update. AdamW applies weight decay directly to the parameter, untouched by the second-moment scaling:
The result is predictable, stable, and transferable. AdamW is the default for every modern LLM. GPT-3, LLaMA, Mistral — all use it. A typical weight decay setting is λ = 0.1.
Here is the complete family, so you can see the evolution at a glance. "States per param" determines the optimizer's memory footprint — critical at billion-parameter scale.
| Optimizer | Update rule (core) | States / param | Key hyperparams | When to use |
|---|---|---|---|---|
| SGD | w ← w − lr · ∇w |
0 | lr |
Computer vision with carefully tuned LR; rarely for LLMs |
| SGD + Momentum | v ← βv + ∇w; w ← w − lr · v |
1 (v) |
lr, β ≈ 0.9 |
Vision pretraining; faster than plain SGD on narrow valleys |
| RMSProp | v ← β₂v + (1−β₂)g²; w ← w − lr·g/√(v+ε) |
1 (v) |
lr, β₂ ≈ 0.999 |
RNNs and non-stationary objectives; predecessor to Adam |
| Adam | w ← w − lr · m̂/(√v̂+ε) |
2 (m, v) |
lr, β₁=0.9, β₂=0.999, ε=1e-8 |
Most DL tasks; strong default for fine-tuning |
| AdamW | w ← w − lr · m̂/(√v̂+ε) − lr·λ·w |
2 (m, v) |
lr, β₁, β₂, ε, λ=0.1 |
Default for all modern LLM pretraining and fine-tuning |
| Lion | w ← w − lr · sign(β₁m + (1−β₁)g) |
1 (m) |
lr, β₁=0.9, β₂=0.99, λ |
Memory-constrained training; some Google runs; needs 3–10× lower lr than Adam |
| AdaFactor | Factored second moment (row + col) | ~2/rank of W | Often no lr needed (relative step size) | Very large models where optimizer state dominates memory (T5, PaLM) |
Memory math for a 7B model. 7B parameters × 4 bytes each = 28 GB for weights. Adam (2 states × 4 bytes × 7B) = another 56 GB — twice the weight cost, just for optimizer state. That is why inference needs only the weights (28 GB in FP32, 14 GB in FP16) while training of the same model needs at minimum ~84 GB in FP32. Lion halves the optimizer state to 28 GB by using only one running state.
For this curriculum, the default is AdamW unless we explicitly say otherwise.
√v̂ divides the steep direction's gradient by its own RMS, effectively re-rounding the ellipses into near-circles — so Adam tracks the valley axis directly.Adam stands for ADaptive Moment estimation. Kingma and Ba published it in 2014, and within months it had become the default optimizer in deep learning. Three years later, Loshchilov and Hutter (2017) noticed that Adam was applying weight decay incorrectly. Their fix is AdamW, now the standard for every LLM you've heard of. The lesson is humbling: a subtle bug sat inside the most popular optimizer in machine learning for three years before anyone wrote a paper about it.
A constant learning rate is rarely optimal. The schedule that essentially every modern LLM uses is the same five-line function: linearly ramp the learning rate from 0 to the target peak over the first few hundred to few thousand steps, and then cosine-decay it down to nearly zero over the rest of training.
def lr_at(step, peak_lr=3e-4, warmup=500, total=10000):
if step < warmup:
return peak_lr * step / warmup
progress = (step - warmup) / (total - warmup)
return peak_lr * 0.5 * (1 + math.cos(math.pi * progress))
At the very beginning of training, Adam's second-moment estimate v̂ is unreliable — there simply haven't been enough gradient samples yet to get a stable variance estimate. Combine a high learning rate with a noisy v̂ and the update m̂ / √v̂ can spike to enormous values. The result is one bad step that destabilizes the model and possibly NaN's it out. Warmup avoids this by starting the learning rate at zero and ramping it up linearly, so by the time you reach the peak lr, the running statistics have stabilized.
Cosine has two practical virtues. It is smooth, so there are no manual step decisions to tune. And it is "lazy" — it spends most of training near the peak rate and only really decays sharply in the last fraction of the run. Empirically, cosine consistently produces lower final loss than constant learning rate or step decay on LLM workloads. It has become the default for that reason, not because of any deep theoretical justification.
The standard procedure is to run a learning-rate sweep. Pick five values an order of magnitude apart — say 1e-1, 1e-2, 1e-3, 1e-4, 1e-5 — and train each for a few hundred steps. Plot the final loss against the learning rate on a log scale. You will see a U-shape: too small, the model crawls; too large, it diverges; in between, it is fine. Pick the minimum.
Rough magnitudes to expect:
lr ∝ 1/√t after warmup. Older recipe; cosine has largely replaced it.You can have a perfect optimizer, a perfect loss, and a perfect dataset, and still completely fail to train a network if you initialize its weights poorly. Two principles drive any sensible initialization scheme.
The first principle is to break symmetry. If you initialize every weight to zero (or to any single value), every neuron in a given layer computes the same thing on the same input. Their gradients then come out identical. The next update changes every neuron in the same way. The neurons stay in lockstep forever. The network never learns. The fix is to initialize with random values, so that each neuron starts in a slightly different place and gradients quickly diverge.
The second principle is to keep activations and gradients well-scaled. If the initial scale is too small, activations shrink as they pass through layers and gradients vanish to zero. If the initial scale is too large, activations grow and saturate, and the gradient again vanishes (because activation functions like sigmoid have nearly-zero derivatives in their saturated regions). The right scale depends on the activation function and the layer width.
Xavier init was designed to keep activation variance approximately constant across layers, under the assumption that the activation function is roughly linear near the origin. That assumption holds for tanh and sigmoid in their useful range, so Xavier works well for those.
For ReLU, Xavier is too small. The reason is that ReLU zeros out roughly half of its inputs (the negative half). After ReLU, activation variance is roughly halved. Without compensation, variance keeps halving with every layer, and gradients vanish exponentially in depth. He init compensates by doubling the input variance — that is where the factor of 2 in 2/fan_in comes from. With He init for ReLU layers, activation variance stays roughly constant across the depth of the network.
Most modern LLMs don't strictly follow He or Xavier — they use a fixed small normal distribution instead. The reasons are partly historical (it's what GPT-2 used and worked well) and partly that AdamW's adaptive scaling absorbs much of the input-scale variation that Xavier and He are trying to compensate for.
N(0, 0.02), the GPT-2 convention.1/√(2L), where L is the number of layers. This extra factor keeps deeper networks stable at initialization.γ: 1. LayerNorm bias β: 0.N(0, 0.02), sometimes with a different scale.The exact recipe is documented in Section 2.3 of the GPT-2 paper, and LLaMA-2 uses similar constants.
You might wonder: if symmetry must be broken, why is it OK to initialize every bias to zero? The answer goes back to Day 3. The bias gradient is db = dz.sum(axis=0), where dz is the upstream gradient. Even if every bias starts at zero, dz differs across neurons (because the weights are random), so different biases receive different gradients on the very first backward pass. Symmetry breaks immediately. Weights, by contrast, would stay locked together if initialized equal.
Training a neural network goes wrong in fairly predictable ways. Recognizing each pattern by its signature in the loss curve and gradient-norm trace will save you days of confusion.
| Symptom | Likely cause | Cure |
|---|---|---|
| Loss → NaN | Learning rate too high; exploding gradients; numerical issues like dividing by something close to zero in softmax or RMSNorm | Lower the learning rate; add gradient clipping with max_norm = 1.0; make sure your softmax uses the max-subtraction trick |
Loss flat at log(C) |
Symmetry not broken (zero init); or you applied softmax before passing into F.cross_entropy, ruining the magic gradient |
Use random init; pass logits (not probabilities) to the cross-entropy function |
| Loss decreases then plateaus high | Model under-capacity; or the learning-rate scheduler ended too early | Bigger model; longer schedule |
| Train loss far below val loss | Overfitting | Increase weight decay (AdamW λ); add dropout; collect more data |
| Gradient norm spikes | Outlier batch; numerical instability | Gradient clipping by total norm |
| Activations all zero in some neurons | Dead ReLUs (initial bias too negative, or learning rate too high) | He init; switch to LeakyReLU or GELU; lower learning rate |
By the end of this section you'll understand why residual connections are not a nice-to-have but a hard requirement for deep networks. You don't need to know the full Transformer architecture yet — just file this away and it will click on Day 7.
Vanishing gradients happen when gradients shrink toward zero as they propagate back through layers. The intuition: if each layer's local Jacobian has values slightly less than 1, those values multiply together as we walk backward. With 50 layers, a factor of 0.9 per layer gives 0.9⁵⁰ ≈ 0.005 — the gradient at layer 1 is 200× smaller than at layer 50. Early layers receive almost no learning signal and fail to train.
Sigmoid and tanh are the classic culprits. Sigmoid's derivative is σ(x)(1−σ(x)), which peaks at 0.25 and falls toward zero for large positive or negative inputs. If activations drift outside the linear region, the gradient effectively stops.
The modern Transformer recipe is engineered from the ground up to defeat vanishing gradients: ReLU and GELU have derivative ≈ 1 over half their domain; He init keeps activation variance constant across depth; residual connections (Day 7) provide gradient highways; LayerNorm (Day 7) re-centers activations every layer. Together these make 100-layer networks routine. You don't need to know the mechanism yet — just hold onto the motivation.
The mirror image. If local Jacobians are consistently > 1, gradients grow exponentially and weights blow up to NaN. The fix is gradient clipping:
In PyTorch: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0). One line, one number to tune, prevents most LLM training runs from blowing up. The typical value max_norm = 1.0 is a near-universal default for LLMs. Gradient norm itself is worth logging — a spike in the norm trace is often the first visible sign of a data problem or numerical instability, appearing several steps before the loss diverges.
A ReLU neuron is "dead" when its pre-activation is always negative on every input it sees. The output is then always zero, the local derivative is zero, no gradient flows to the incoming weights, and the neuron stays dead forever. This usually happens when the learning rate is too high (one bad step pushes the bias deeply negative) or initialization is poor.
Cures in order of effectiveness: He init (places activations in the right magnitude to start), lower learning rate (smaller steps don't overshoot), and switch to LeakyReLU or GELU (which have small but nonzero derivatives for negative inputs, so a "dead" neuron can revive itself).
Before you debug anything else, run this test. Take a single batch, repeat it for a hundred steps, and see if the model can memorize it perfectly. If it cannot, you have a real bug — not a hyperparameter problem. The bug is usually the wrong loss function, a frozen parameter (you forgot to enable gradients), or a miscoded forward pass that silently produces wrong outputs. Fix the bug first; then tune.
Every concept in this lesson — gradients, optimizer states, activation caching — exists solely to support training. At inference time, none of it runs. That single fact is the origin of the enormous memory gap between training a model and serving it.
During a forward pass at inference time you need: the model weights, and the activations of the current layer (for the next layer's matmul). As soon as a layer's output has been used, its input activations can be freed. At any moment you hold at most two layers' activations in memory — not all of them at once.
During training the forward pass must keep every activation alive until the backward pass has used it. For a deep Transformer with long sequences, this can be gigabytes of intermediate tensors. Activation checkpointing (recomputing some activations on the fly during backward) trades compute for memory, but still requires re-running a portion of the forward pass. Inference never has this problem at all.
Adam's running states (m and v) exist only during training. For a model with P parameters in FP32, the full training memory bill is roughly:
For a 7B-parameter model: training ≈ 112 GB, inference ≈ 14 GB. An 8× difference, entirely from removing training-only state. This is why a model that needed 8 A100s to train can be served on a single one (or even a consumer GPU after quantization).
Day 10 will build the full training loop, connecting backprop + AdamW + gradient clipping + mixed precision into a working training harness. Week 3 covers serving — where the absence of backward pass is what makes large-batch, low-latency inference tractable. You don't need to internalize those details now; just remember that everything in this lesson costs zero at inference time.
Lion, a more recent optimizer used by Google for some training runs, was discovered by an automated search. The algorithm wasn't designed by a human; it was found by symbolic search over 10⁹ candidate update rules. The winning rule keeps only one running state per parameter, so it uses half the memory of Adam. The fact that a brute-force search produced a competitive update rule is striking — it suggests the design space of optimizers may be much richer than human researchers have explored.
Day 10 will cover this in depth, but here is the headline. Modern training keeps the master copy of the weights in FP32 for accumulation accuracy, but runs the forward and backward passes in BF16 (or FP16) for memory and speed. The optimizer step then happens back in FP32. The result is roughly a 2× memory reduction and a 2–4× throughput improvement on Tensor Core hardware, with no measurable quality loss.
BF16 has the same exponent range as FP32 (just with less mantissa precision), so no special tricks are needed. FP16 has a narrower exponent range and can underflow gradients to zero, so it requires an additional helper called torch.amp.GradScaler('cuda') that scales the loss up before backward and down before the optimizer step, keeping the gradients in FP16's representable range.
The PyTorch idiom is a one-liner:
with torch.amp.autocast('cuda', dtype=torch.bfloat16):
logits = model(x)
loss = F.cross_entropy(logits, y)
loss.backward()
optimizer.step()
For now: when you see autocast or bfloat16 in modern training code, this is the technique it's invoking.
The companion notebook day-4-backprop-optimizers.ipynb walks through each one. Highlights:
Compute ∂L/∂W and ∂L/∂b for L = (sigmoid(W·x + b) − y)², where x is a vector, b is a scalar bias, and y is a scalar target. Show every chain-rule step. Then verify your analytic gradients by finite difference: perturb W and b by a small h, recompute L, and compare (L_plus − L_minus) / (2h) to the analytic derivative.
Implement the Adam update with bias correction. Test it on a quadratic problem f(W) = Wᵀ A W for a fixed positive-definite A. Compare the loss-vs-step curves for SGD, SGD+Momentum, and Adam at the same learning rate. With an ill-conditioned A, Adam should crush both of the others.
Train your Day 3 MNIST MLP with both optimizers, weight decay 0.01, three epochs each. Compare validation accuracy. AdamW tends to generalize slightly better; you should be able to see the difference.
Train at lr ∈ {1e-1, 1e-2, 1e-3, 1e-4, 1e-5}. Plot final training loss vs learning rate on a log-log plot. The minimum of the U is your peak lr for this model and dataset.
Implement lr_at(step) from the schedule section. Plot the curve. Train your MLP with this schedule and compare its final loss to a constant-LR baseline.
Crank the learning rate until your MLP NaN's out. Then add clip_grad_norm_(params, 1.0) and re-run with the same too-high LR. With clipping, training should survive — slowly — instead of diverging.
Close the page and answer from memory. If you can't, re-read the relevant section.
L = (a·b + c)² with a=2, b=3, c=4 from memory. State ∂L/∂a, ∂L/∂b, and ∂L/∂c.Z = XW (X is 4×8, W is 8×16, so Z is 4×16). The upstream gradient ∂L/∂Z has shape 4×16. What are the shapes of ∂L/∂X and ∂L/∂W? Write the formulas.√v̂ term effectively do? Why is the algorithm called adaptive?log(vocab_size) on the first step. Is that good or bad? Why?x + f(x)) help training depth? (You don't need to know how Transformers implement it — just the gradient argument.)"The chain rule is the most important rule in machine learning."
Hand-picked references for this lesson. Free where possible.
Autograd in 150 lines of Python. Reading this end to end will demystify every framework's backward pass.
View repoChristopher Olah's clean visual explanation of forward vs reverse mode autodiff. Exceptional clarity.
Read postPractical reasons why you should be able to derive backward passes by hand — and the bugs that hide otherwise.
Read postThe original Adam paper. Read Section 2 (algorithm) and Section 3 (convergence analysis).
Open paperDecoupled weight decay regularization. The fix that made AdamW the LLM default.
Open paperSebastian Ruder's comprehensive survey of optimization algorithms with intuition for each.
Read postWhere He / Kaiming initialization comes from. Section 2 derives the 2/fan_in constant.
Live coding. He walks through optimizer and scheduler choices for a real LLM training run.
Watch on YouTubeGoogle's automated search for an optimizer. Found a single-state update rule (no v) that's competitive with AdamW.
MIT-level course building a real LLM from scratch, including optimizer and training-loop lectures. Lecture notes are publicly available.
View courseThe LayerNorm paper — the normalization technique used in every Transformer. Read alongside the vanishing-gradient section to see why normalization is necessary at depth.
Open paper