Day 07 — The Transformer Block & Full Architecture · LLM Inference Engineer Curriculum

Why This Lesson

Yesterday: attention. Today: everything around attention that makes a Transformer actually work.

On Day 6 we built scaled dot-product attention from scratch. Attention is the most important idea in a Transformer, but it isn't the entire model. A real Transformer block wraps attention in several supporting structures: layer normalization to keep activations well-behaved, residual connections to let gradients flow through deep stacks, and a feed-forward sublayer to do per-token computation. Today we will assemble all of those pieces, and then stack the result into a complete decoder-only language model — the same architecture as GPT, LLaMA, Mistral, and every other modern open LLM.

This is the Week 1 capstone. By the end of today you will have a complete forward pass for a tiny LLM, written in PyTorch, with every intermediate shape verified. On Day 9 we will train it.

Learning objectives

Implement layer normalization from scratch and explain how it differs from batch normalization and from RMSNorm.
Build a feed-forward sublayer with GELU. Understand SwiGLU as the LLaMA-style replacement and know why it exists.
Combine attention, FFN, residual connections, and norms into a single Transformer block using the pre-norm pattern.
Explain the "residual stream" view: why residuals make deep networks trainable and what they mean for the gradient highway.
Stack blocks into a full decoder-only language model with token embeddings, positional embeddings, an output head, and weight tying.
Compute parameter counts for any GPT-style configuration on the back of an envelope, including the 12D² per-block rule.
Distinguish encoder-only, decoder-only, and encoder-decoder architectures, and explain why decoder-only became the dominant family.
Trace tensor shapes [B,T] → [B,T,D] → … → [B,T,V] through every sublayer from embedding to logits.

The Big Picture

Six stages, one picture, tensor shapes at every arrow.

Before diving into each component, look at the full architecture in one diagram — with every tensor shape labeled. If you stare at this for two minutes before reading the rest of the lesson, the details will fall into place naturally.

The full decoder-only stack, with every tensor shape labeled. Token IDs [B, T] enter at the top; vocabulary logits [B, T, V] exit at the bottom. The shape [B, T, D] is preserved through all N blocks. Weight tying (dashed arrow) means the LM head shares storage with the token embedding matrix, saving V × D parameters.

Six stages. Six shapes to internalize. The whole model is a chain of transformations that keeps the shape [B, T, D] constant until the very last linear projection widens it to [B, T, V]. If you internalize nothing else today, internalize that shape story.

Layer Normalization

Normalize each token across its features. Re-scale with learned parameters.

Before we look at the formula, it is worth being clear about what problem normalization is trying to solve. As activations flow through the layers of a deep network, their magnitudes drift. Some layers' outputs grow large. Other layers produce activations that shrink toward zero. Either drift causes problems: large activations push activation functions into saturation regions where the gradient is tiny, and small activations shrink the signal until learning effectively stops. Without some form of normalization, very deep networks become unstable and slow to train.

Layer normalization is one solution. For each token independently, it computes the mean and variance across the token's feature dimensions, normalizes the token to zero mean and unit variance, and then re-scales using two learnable parameters:

given x ∈ ℝ^D (a single token's embedding vector, length D): μ = mean(x) # Python: x.mean() — average over D floats σ² = var(x) # Python: x.var() — variance over D floats LN(x) = γ · (x − μ) / √(σ² + ε) + β # γ, β ∈ ℝ^D (learned), ε = 1e-5

Symbol guide: ∈ means "is in" (like Python in); ℝ^D means "a vector of D real-valued floats". The learned parameters γ (gamma) and β (beta) let the layer recover any scaling and shifting it might want — the normalization reduces representational power, but the learned affine transform restores it. The small constant ε is inside the square root to prevent division by zero.

One detail is critical. Layer norm normalizes across the feature dimension of each token, not across the batch dimension. This is the key difference from batch normalization, which was popular in computer vision but has problems for sequence data. Because layer norm doesn't mix information across batch elements, it doesn't care about batch size, doesn't behave differently in training versus evaluation mode, and parallelizes cleanly across hardware. These practical advantages are why layer norm is the default for transformers.

RMSNorm — drop the mean centering

A more recent variant called RMSNorm simplifies layer norm by removing the mean subtraction and the bias β. The remaining operation divides by the root-mean-square of the activations and multiplies by the learned scale:

RMS(x) = √( mean(x²) + ε ) # root-mean-square of x RMSNorm(x) = γ · x / RMS(x) # scale only — no centering, no β

Why drop the mean? Empirically, the mean centering turns out to make essentially no difference for transformer training. Removing it makes the operation about 5% faster, and you save the parameter β (a modest reduction in optimizer state). RMSNorm is now used by LLaMA, Mistral, T5, Qwen, and most other modern open LLMs. If you are designing a new model today, prefer RMSNorm.

Comparison table

Property	LayerNorm (GPT-2)	RMSNorm (LLaMA)
Subtracts mean	Yes	No
Divides by std	Yes (σ)	By RMS instead
Learned scale γ	Yes (D params)	Yes (D params)
Learned shift β	Yes (D params)	No
Total params per layer	2D	D
Speed difference	baseline	~5% faster
Empirical quality	same	same
Modern default	No (legacy)	Yes

LayerNorm versus RMSNorm. RMSNorm drops mean centering and the bias β. Dividing by the root-mean-square is almost free on a GPU — it's one fused kernel.

Pre-norm versus post-norm

There is one more architectural choice to make: where the normalization goes. The original 2017 Transformer used post-norm, which applies normalization after the residual addition: x ← LayerNorm(x + Sublayer(x)). This works for shallow networks but is hard to train deep without elaborate warmup tricks.

Modern transformers use pre-norm instead, applying normalization before the sublayer and adding the residual to the original input: x ← x + Sublayer(LayerNorm(x)). The advantage is that the residual stream x is never normalized, so gradients can flow back through the addition without being modified by the norm layer. This is what makes 100-layer networks trainable. Every modern recipe uses pre-norm with RMSNorm.

Post-norm (left) puts the norm after the residual add: gradients must pass through the norm on the way back. Pre-norm (right) puts the norm before the sublayer: the raw residual stream bypasses normalization entirely, giving gradients a clean highway. Pre-norm is standard in every modern LLM.

Residual Connections

y = x + f(x). The single most important architectural trick after attention.

A residual connection replaces y = f(x) with y = x + f(x). The output of the layer is the input plus a learned modification, rather than a fresh transformation. This sounds like a small change but it does two important things at once.

Gradient highway — why deep nets train

The first benefit is making the network trainable at depth. Without residuals, gradients during backprop have to flow through every layer's Jacobian (recall Day 4: the chain rule multiplies Jacobians together). If even a few of those Jacobians shrink the gradient — because activations are saturated, or weights are small — the gradient vanishes by the time it reaches early layers. This is the vanishing gradient problem we discussed on Day 4.

With a residual y = x + f(x), the backward pass through that layer becomes:

∂L/∂x = ∂L/∂y · (I + ∂f/∂x) = ∂L/∂y + ∂L/∂y · ∂f/∂x ← two paths

There are now two gradient paths: one through ∂f/∂x (the normal path, which might shrink), and one through the identity I (the residual path, which passes the gradient unchanged). Even if ∂f/∂x is nearly zero, the second term keeps the gradient alive. Networks can be 100 layers deep and early layers still receive a healthy gradient signal.

The second benefit is the identity prior. At initialization, the function f typically produces outputs close to zero (with standard weight initialization). The whole layer therefore behaves like the identity: y ≈ x. The model adjusts each layer incrementally during training, rather than replacing its representation outright. This shapes the loss landscape favorably.

The residual stream view

Anthropic has a useful framing known as the residual stream. Picture the running activation as a stream of vectors flowing from layer 1 to the output. Each block doesn't transform the stream wholesale — it reads from the stream (via the norm), computes something, and writes its contribution back into the stream by addition. Different layers are decoupled communicators on a shared bus, rather than a strict pipeline.

This view is both intuitively useful and theoretically illuminating. It explains why interpretability research can reason about a Transformer's circuits as compositions of "reads" and "writes" along a single high-dimensional channel (see the Transformer Circuits reference). And it explains why the embedding dimension D is so precious: it is the bandwidth of the shared bus. Every head and every FFN layer has to fit its information into this fixed-width channel.

The residual stream view. Each attention and FFN sublayer reads from the stream (via a norm branch), computes a delta, and writes that delta back by addition. The stream x is never fully replaced — it accumulates contributions from all layers. This is why gradients have a clean path back through 100 blocks.

Feed-Forward Sublayer

Each token, processed independently, by a 4× expand-then-contract MLP.

After attention has mixed information across tokens, the feed-forward sublayer (often called the FFN or MLP) processes each token independently. The standard pattern is "expand and contract": map the D-dimensional input up to a much higher dimension, apply a non-linearity, and then project back down to D.

GPT-style FFN with GELU

FFN(x) = W₂ · GELU( W₁ · x + b₁ ) + b₂ W₁ shape: [4D, D] — expands D → 4D W₂ shape: [D, 4D] — contracts 4D → D params: 4D·D + D·4D = 8D² (two-thirds of the block!)

The first weight matrix W₁ expands from D to 4D. The second matrix W₂ contracts back from 4D to D. The 4× expansion is the canonical choice; it traces back to the original Transformer paper and is consistent across the GPT family. The non-linearity in between is GELU (Gaussian Error Linear Unit), a smooth approximation to ReLU that performs slightly better empirically.

The feed-forward sublayer expands each token from D to 4D, applies GELU element-wise, then contracts back to D. The two projection matrices hold about 8D² parameters — more than the attention sublayer (4D²). Attention mixes across tokens; the FFN computes within each token.

SwiGLU — the LLaMA replacement

LLaMA, Mistral, and most modern open LLMs use a different FFN design called SwiGLU. SwiGLU uses three weight matrices instead of two, with one of them acting as a multiplicative gate.

FFN_SwiGLU(x) = W_down · ( silu(W_gate · x) ⊙ (W_up · x) ) silu(x) = x · sigmoid(x) # smooth activation, similar to GELU ⊙ = elementwise multiply # Python: * on same-shape tensors hidden = (8/3) × D # ~2.67× instead of 4× — keeps param count ≈ same

The output of one projection (after silu) acts as a gate that scales each element of another projection. The gating mechanism gives the layer more expressive power per parameter — SwiGLU consistently outperforms plain GELU FFNs on LLM benchmarks. Because SwiGLU has three matrices instead of two, it needs a smaller hidden dimension (8/3 × D ≈ 2.67× instead of 4×) to keep the parameter count comparable.

Parameter count breakdown

FFN type	Hidden dim	Matrix count	Params (per block)
GELU (GPT-2)	4D	2	8D²
SwiGLU (LLaMA)	≈8D/3	3	≈8D²

Why is the FFN so big?

It is worth pausing to notice that the FFN, not attention, is the largest parameter component of a Transformer block. For LLaMA-7B, the FFN accounts for roughly two-thirds of all model parameters. A useful intuition is that attention does mixing (it decides which tokens influence which other tokens), while the FFN does computation (it transforms each token's representation in isolation). Most of the model's "knowledge" is stored in the FFN weights — the FFN functions somewhat like a key-value store that looks up factual associations based on the token's current hidden state.

The Block

Norm, attention, residual. Norm, FFN, residual. That's a Transformer block.

We can now combine attention and the FFN into a single Transformer block, using pre-norm with residuals. Each of the two sublayers (attention and FFN) gets its own normalization and its own residual connection, in this pattern:

x ← x + attn( LN(x) ) # sublayer 1: attention x ← x + mlp ( LN(x) ) # sublayer 2: feed-forward

Here is the full PyTorch implementation. We use F.scaled_dot_product_attention for the inner attention computation — modern PyTorch automatically dispatches that to FlashAttention on supported hardware, which we will study in detail on Day 21.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class MLP(nn.Module):
    def __init__(self, d_model: int):
        super().__init__()
        self.fc1 = nn.Linear(d_model, 4 * d_model, bias=False)
        self.fc2 = nn.Linear(4 * d_model, d_model, bias=False)

    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x)))

class MultiHeadCausalAttention(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        assert d_model % n_heads == 0
        self.n_heads = n_heads
        self.d_head = d_model // n_heads
        self.W_QKV = nn.Linear(d_model, 3 * d_model, bias=False)
        self.W_O   = nn.Linear(d_model, d_model, bias=False)

    def forward(self, x):
        B, T, D = x.shape
        qkv = self.W_QKV(x)
        q, k, v = qkv.chunk(3, dim=-1)
        q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
        # PyTorch 2.0+ — dispatches to FlashAttention when available.
        out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
        out = out.transpose(1, 2).contiguous().view(B, T, D)
        return self.W_O(out)

class Block(nn.Module):
    def __init__(self, d_model: int, n_heads: int):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn  = MultiHeadCausalAttention(d_model, n_heads)
        self.norm2 = nn.LayerNorm(d_model)
        self.mlp   = MLP(d_model)

    def forward(self, x):
        # Pre-norm + residual: read from the stream, write back to it.
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp (self.norm2(x))
        return x

A single pre-norm Transformer block. The residual stream runs vertically; each sublayer reads from it via a layer norm, computes something, and adds the result back. The shape [B, T, D] is preserved at every arrow. Stack N of these and you have the transformer body.

The Full Decoder-Only LLM

Embedding, N blocks, final norm, language-model head. About 50 lines of model code.

With the Transformer block in hand, the full model is straightforward. We add a token embedding (covered on Day 5), a learned positional embedding, a stack of n_layers blocks, a final layer norm, and an output projection that maps the final hidden state back to vocabulary logits.

class GPT(nn.Module):
    def __init__(self, vocab_size: int, d_model: int, n_heads: int,
                 n_layers: int, max_seq_len: int):
        super().__init__()
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        self.blocks  = nn.ModuleList([
            Block(d_model, n_heads) for _ in range(n_layers)
        ])
        self.norm_f  = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying — share storage with the input embedding.
        self.lm_head.weight = self.tok_emb.weight

    def forward(self, idx):
        # idx: (B, T) long tensor of token IDs
        B, T = idx.shape
        positions = torch.arange(T, device=idx.device)
        x = self.tok_emb(idx) + self.pos_emb(positions)        # (B, T, D)
        for block in self.blocks:
            x = block(x)
        x = self.norm_f(x)
        logits = self.lm_head(x)                                # (B, T, V)
        return logits

Weight tying

Weight tying means the LM head matrix and the token embedding matrix are the same tensor in memory: lm_head.weight = tok_emb.weight. Both layers map between the vocabulary index space and the hidden dimension space, so sharing weights is conceptually natural. In a V=50,257, D=768 model (GPT-2 small), this saves 50,257 × 768 × 4 bytes ≈ 150 MB. You get it for free in PyTorch by assigning the weight attribute.

That is roughly 50 lines of model code, and it is a real, working language model. On Day 9 we will train this on real text and watch loss come down. The architecture has not fundamentally changed since 2017; what has changed is the scale at which we run it.

Parameter Count Math

12 D² per block. Memorize this — it sizes everything.

Knowing parameter counts cold is one of the most useful skills for an LLM engineer. You will use it constantly to figure out what fits in your hardware, how much VRAM training will need, and how long an inference run will take. The arithmetic is simple if you remember what each component contributes.

Per block, with GELU FFN

Letting D = d_model:

Component	Params	Explanation
Attn QKV projection	3D²	W_QKV maps D → 3D
Attn output projection	D²	W_O maps D → D
FFN up projection	4D²	W₁ maps D → 4D
FFN down projection	4D²	W₂ maps 4D → D
Two LayerNorms	4D	γ + β each, negligible
Block total	≈ 12D²	the rule of thumb to memorize

Embeddings and head

Token embedding: V × D
Position embedding (learned): T_max × D (small, often <1% of total)
LM head: V × D if untied; effectively zero if tied to the input embedding

Worked example — GPT-2 small

GPT-2 small uses D = 768, n_layers = 12, n_heads = 12, V = 50,257, T_max = 1024, with weight tying:

Blocks: 12 × 12 × 768² ≈ 85M
Token plus position embedding: (50,257 + 1024) × 768 ≈ 39M
LM head: tied to the input embedding, so it costs nothing extra
Total: ~124M ✅ — matches the published number

Worked example — LLaMA-7B

LLaMA-7B is more complex because it uses SwiGLU and untied embeddings. Configuration: D = 4096, n_layers = 32, n_heads = 32, V = 32,000, SwiGLU hidden dimension around 11,008, RMSNorm:

Attention per block: 4 × 4096² ≈ 67M
MLP per block (three SwiGLU matrices at hidden 11008): 3 × 4096 × 11008 ≈ 135M
Per block total: ~202M. With 32 layers: ~6.5B
Embedding plus LM head (untied): 2 × 32,000 × 4096 ≈ 262M
Total: ~6.7B ✅

This kind of arithmetic comes up constantly when you are deciding what fits in memory, what training budget you need, or how many GPUs to provision. Practice it until it's automatic.

Architecture Family Tree

Encoder-only, decoder-only, encoder-decoder — and why decoder-only won.

Transformers come in three architectural families, distinguished mainly by their attention mask and training objective.

Family	Examples	Attention	Trained on	Used for
Encoder-only	BERT, RoBERTa, DeBERTa	Bidirectional	Masked LM	Classification, NER, retrieval embeddings
Decoder-only	GPT, LLaMA, Mistral, Qwen, DeepSeek, Gemma	Causal self-attention	Next-token prediction	Generation (and everything-via-prompting)
Encoder-decoder	T5, BART, original Transformer	Encoder bidir + decoder causal + cross-attn	Span corruption, seq2seq	Translation, summarization

Why decoder-only won

For modern LLMs, decoder-only architectures have decisively won the design competition. Several reasons combined to produce that outcome:

Simpler architecture. A single stack instead of two. Less code, fewer moving parts, fewer hyperparameters to tune.
Trains directly on raw text. No need to split inputs from outputs ahead of time, the way a translation model does.
Generation and "understanding" use the same machinery. A decoder-only model can do classification, summarization, translation, code completion, and creative writing all by prompting alone — there is only one model to train and serve.
Empirical scaling. Decoder-only models have proven to scale particularly well as you increase parameters and data. The Chinchilla and GPT-4 lineage lives here.
KV-cache friendly. Because generation is purely autoregressive, the KV cache works cleanly: you only ever need to append one token per step. Encoder-decoder models require maintaining two KV caches and have more complex memory management.

For the rest of this curriculum, when we say "LLM" we mean decoder-only Transformer. Every modern open LLM falls in this family.

Forward-Pass Shape Walkthrough

Trace shapes through every sublayer. If anything looks mysterious, read it again.

The single best test of whether you understand the architecture is to trace shapes through a forward pass. We will use a small configuration: B = 4 sequences in the batch, T = 128 tokens per sequence, vocabulary size V = 32000, hidden dimension D = 512, with n_heads = 8 and n_layers = 6.

idx:                 (4, 128)             int64
tok_emb(idx):        (4, 128, 512)
pos_emb(positions):     (128, 512)        broadcasts to (4, 128, 512)
x = tok + pos:       (4, 128, 512)

for each block:
  norm1(x):          (4, 128, 512)        ← no shape change
  attn:
    W_QKV(x):        (4, 128, 1536)       ← 3 × D = 3 × 512
    chunk -> q,k,v:  3 × (4, 128, 512)
    reshape, perm:   3 × (4, 8, 128, 64)  ← (B, n_heads, T, d_head)
    SDPA(q,k,v):       (4, 8, 128, 64)
    perm, reshape:     (4, 128, 512)
    W_O:               (4, 128, 512)
  x = x + attn:      (4, 128, 512)        ← residual add
  norm2(x):          (4, 128, 512)
  mlp:
    fc1:               (4, 128, 2048)     ← 4 × D = 4 × 512
    GELU:              (4, 128, 2048)
    fc2:               (4, 128, 512)
  x = x + mlp:       (4, 128, 512)        ← residual add

norm_f(x):           (4, 128, 512)
lm_head(x):          (4, 128, 32000)      logits — (B, T, V)

If any shape feels mysterious, read it again. Internalize it. Every modern LLM is some variation of this same shape walk, just with bigger numbers.

Why This Matters for Inference

This block is the unit you will optimize for the rest of the curriculum.

Everything we covered today is not just architecture trivia — it is the precise unit that inference engines spend their lives running. Here is why each piece matters from a systems perspective.

Per-token FLOPs ≈ 2N

A rough but famous rule: running a single new token through a model with N parameters requires approximately 2N floating-point operations. The factor of 2 comes from the multiply-add structure of matrix multiplications. For LLaMA-7B (N ≈ 7 billion), that is roughly 14 GFLOPs per token. On a GPU that can do 80 TFLOPs, you can generate about 5,700 tokens per second per GPU — assuming you are compute-bound. In practice, for small batch sizes you are usually memory-bandwidth bound instead, and actual throughput is lower. We will derive this precisely on Days 16–17.

Where activations and the KV cache live

During a forward pass, every block materializes its intermediate activations: the QKV projections, the attention scores, the FFN hidden states. For a single token batch of LLaMA-7B in float16, a single block's activations are on the order of megabytes. For large batches or long contexts, activation memory becomes significant, which is why gradient checkpointing is essential during training.

For inference, the critical memory structure is the KV cache: for every block, the keys and values from all past tokens are stored so they don't have to be recomputed on each new token. Per block, the KV cache size is 2 × T × D_kv × sizeof(dtype). For LLaMA-7B with a 4096-token context in float16, the KV cache is roughly 32 blocks × 2 × 4096 × 4096 × 2 bytes ≈ 2 GB per sequence. Serving hundreds of concurrent sequences requires careful memory management — this is what PagedAttention (Day 24) addresses.

The block is the optimization unit

When you read about FlashAttention (Day 21), quantization (Day 22), speculative decoding (Day 23), or continuous batching and PagedAttention (Day 24), the unit being optimized is always the Transformer block. FlashAttention rewrites the attention sublayer to avoid materializing the full attention matrix. Quantization changes the dtype of the weight matrices inside the block. Tensor parallelism splits the block's matrix multiplications across GPUs. Everything in Weeks 3–4 is the block anatomy you built today, subjected to increasingly sophisticated systems engineering.

The KV cache for a single 128K-token context window on LLaMA-3 70B occupies roughly 40 GB — more than some entire GPU cards. This is why long-context serving is such an active research area, and why Week 4 spends an entire day on memory management.

Exercise

Eight exercises, all in the notebook.

Companion notebook: day-7-transformer-block.ipynb.

Implement layer norm from scratch.

Verify against PyTorch's nn.LayerNorm. The reference implementation:

def layer_norm(x, gamma, beta, eps=1e-5):
    mu = x.mean(-1, keepdim=True)
    var = x.var(-1, keepdim=True, unbiased=False)
    return gamma * (x - mu) / torch.sqrt(var + eps) + beta

Implement RMSNorm. Compare its output to LayerNorm on random input. Notice that RMSNorm doesn't center the activations — verify this numerically.
Build the full GPT class in your own file. Instantiate with config (V=1000, D=128, n_heads=4, n_layers=4, T_max=64). Forward a random input batch of shape (2, 32). Verify the output shape is (2, 32, 1000).
Parameter count assertion. For your config, compute the parameter count by hand using the 12D² rule. Then compute it with sum(p.numel() for p in model.parameters()). Write an assert that checks they agree within 5%. If they don't, find the discrepancy.
Loss at initialization. Compute cross-entropy loss on random labels after one forward pass. Verify it is close to ln(V) — that is the theoretical value for a randomly initialized model predicting uniformly over the vocabulary.
Causal sanity check. Modify a future token in the input and confirm that earlier output positions don't change. This catches nearly every attention masking bug.
Generate one token. Add a method model.generate(idx, max_new_tokens). For each step: forward pass; take logits at the last position; argmax (greedy) or sample; append to the input; repeat. We don't add a KV cache yet — that's Day 20.
Pre-norm residual behavior. Compare the mean and std of the residual stream x before and after a block. Because it is never directly normalized, the scale of x can drift. Is this a problem? What happens if you remove the final norm_f before the LM head?

Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

Why pre-norm instead of post-norm in modern transformers? What gradient property does pre-norm preserve?
Why are residual connections critical for deep networks? Write the backward-pass equation that shows the gradient highway.
RMSNorm versus LayerNorm: name two differences and state which is used in modern open LLMs.
The FFN expands by 4× in GPT-2 but only ~2.7× (8/3) in LLaMA's SwiGLU. Why the difference?
Estimate the parameter count of a transformer with D=2048, n_layers=24, V=50000, plain GELU MLP, tied embeddings.
Why does weight tying (input embedding equals LM head) work? What does it save in a GPT-2 small model?
Trace input shape (2, 64) through a forward pass and write down the shape after every single sublayer.
Why don't modern LLMs use encoder-decoder architectures?
What is the "residual stream" view? Describe it in one sentence using the word "bus."
A model with 7B parameters processes one new token. Approximately how many FLOPs does that require, and why?

Week 1 Wrap-Up

You now own every line of a Transformer.

What you've covered

Math fluency for ML — vectors, matrices, derivatives, softmax, cross-entropy.
Tensors, autograd, and frameworks — both PyTorch and MLX.
Neural networks from first principles, with manual backprop.
Optimizers, learning-rate schedules, and initialization.
Tokenization (BPE) and embeddings.
Self-attention — single-head, multi-head, and causal masking.
The full Transformer block and the decoder-only LLM stack.

What you can now do

Read transformer code without flinching.
Reason about parameter counts and tensor shapes.
Build any layer of a Transformer from memory.
Hand-derive backprop for any expression you're likely to encounter.
Estimate memory, FLOPs, and KV-cache size for any model config.

What's next — Week 2

We make this thing learn. The Week 2 lessons cover pre-training data and objectives, building and training a tiny GPT on real text, distributed training, modern architectural variations (LLaMA, Mistral, MoE), fine-tuning techniques (SFT, LoRA, QLoRA), and alignment methods (RLHF, DPO).

If you want to consolidate before moving on, the best thing you can do is re-implement everything from scratch in a single notebook. Go end to end: tokenizer, model, forward pass on random tokens. Internalize the shapes. The Day 7 companion notebook walks through exactly this build, plus a small TinyShakespeare training run as an optional bonus.

Go deeper.

Hand-picked references for this lesson and the Week 1 capstone.

YouTube · 2 hr

Karpathy — Let's build GPT from scratch

Builds exactly what we built today, on Shakespeare. Spelled out, in code.

Watch on YouTube

Repo · Karpathy

karpathy/nanoGPT

Roughly 300 lines of training plus 150 lines of model. Read it like a poem.

View repo

YouTube · 4 hr

Karpathy — Reproduce GPT-2 (124M)

Production-grade version of the same code. Optimizer, scheduler, and DDP details.

Watch on YouTube

Book · 2024

Raschka — Build an LLM From Scratch

Chapters 3–4 mirror today's lesson very closely.

View book

Paper · 2022

Phuong & Hutter — Formal Algorithms for Transformers

Sixteen pages of pure pseudocode. Great reference card.

Open paper

Paper · 2016

Ba, Kiros, Hinton — Layer Normalization

Original LayerNorm paper.

Open paper

Paper · 2019

Zhang & Sennrich — RMSNorm

Root Mean Square Layer Normalization.

Open paper

Paper · 2020

Xiong et al. — Pre-norm vs Post-norm

Why pre-norm trains stably to many layers. Theoretical and empirical evidence.

Open paper

Paper · 2023

Touvron et al. — LLaMA

The architecture every open LLM derives from. RMSNorm, RoPE, SwiGLU, GQA.

Open paper

Blog · Anthropic

A Mathematical Framework for Transformer Circuits

The "residual stream" view of a Transformer. Beautifully clarifying.

Read post

Repo · HF

transformers — modeling_llama.py

Production reference: RMSNorm + RoPE + SwiGLU. Read after nanoGPT.

View source

Repo · MLX

mlx-examples — LLMs

MLX-native LLM implementations. Same architecture, Apple Silicon idioms.

View repo

Paper · 2015

He et al. — Deep Residual Learning (ResNet)

The original residual connection paper from computer vision. The same trick made 100-layer CV models trainable, then was adopted wholesale by Transformers.

Open paper

Blog · Eleuther AI · 2022

Su et al. — RoPE: Rotary Position Embedding

The positional encoding used in LLaMA and most modern LLMs. Replaces the learned absolute position embeddings we use today. Day 12 covers it in detail.

Read post