Yesterday we built attention. Today we put everything around it that makes a Transformer actually work — layer normalization, residual connections, the feed-forward sublayer, and finally the full decoder-only stack. This is Week 1's capstone: by the end you will have written a complete, working language model in ~50 lines of PyTorch.
On Day 6 we built scaled dot-product attention from scratch. Attention is the most important idea in a Transformer, but it isn't the entire model. A real Transformer block wraps attention in several supporting structures: layer normalization to keep activations well-behaved, residual connections to let gradients flow through deep stacks, and a feed-forward sublayer to do per-token computation. Today we will assemble all of those pieces, and then stack the result into a complete decoder-only language model — the same architecture as GPT, LLaMA, Mistral, and every other modern open LLM.
This is the Week 1 capstone. By the end of today you will have a complete forward pass for a tiny LLM, written in PyTorch, with every intermediate shape verified. On Day 9 we will train it.
12D² per-block rule.[B,T] → [B,T,D] → … → [B,T,V] through every sublayer from embedding to logits.Before diving into each component, look at the full architecture in one diagram — with every tensor shape labeled. If you stare at this for two minutes before reading the rest of the lesson, the details will fall into place naturally.
[B, T] enter at the top; vocabulary logits [B, T, V] exit at the bottom. The shape [B, T, D] is preserved through all N blocks. Weight tying (dashed arrow) means the LM head shares storage with the token embedding matrix, saving V × D parameters.Six stages. Six shapes to internalize. The whole model is a chain of transformations that keeps the shape [B, T, D] constant until the very last linear projection widens it to [B, T, V]. If you internalize nothing else today, internalize that shape story.
Before we look at the formula, it is worth being clear about what problem normalization is trying to solve. As activations flow through the layers of a deep network, their magnitudes drift. Some layers' outputs grow large. Other layers produce activations that shrink toward zero. Either drift causes problems: large activations push activation functions into saturation regions where the gradient is tiny, and small activations shrink the signal until learning effectively stops. Without some form of normalization, very deep networks become unstable and slow to train.
Layer normalization is one solution. For each token independently, it computes the mean and variance across the token's feature dimensions, normalizes the token to zero mean and unit variance, and then re-scales using two learnable parameters:
Symbol guide: ∈ means "is in" (like Python in); ℝ^D means "a vector of D real-valued floats". The learned parameters γ (gamma) and β (beta) let the layer recover any scaling and shifting it might want — the normalization reduces representational power, but the learned affine transform restores it. The small constant ε is inside the square root to prevent division by zero.
One detail is critical. Layer norm normalizes across the feature dimension of each token, not across the batch dimension. This is the key difference from batch normalization, which was popular in computer vision but has problems for sequence data. Because layer norm doesn't mix information across batch elements, it doesn't care about batch size, doesn't behave differently in training versus evaluation mode, and parallelizes cleanly across hardware. These practical advantages are why layer norm is the default for transformers.
A more recent variant called RMSNorm simplifies layer norm by removing the mean subtraction and the bias β. The remaining operation divides by the root-mean-square of the activations and multiplies by the learned scale:
Why drop the mean? Empirically, the mean centering turns out to make essentially no difference for transformer training. Removing it makes the operation about 5% faster, and you save the parameter β (a modest reduction in optimizer state). RMSNorm is now used by LLaMA, Mistral, T5, Qwen, and most other modern open LLMs. If you are designing a new model today, prefer RMSNorm.
| Property | LayerNorm (GPT-2) | RMSNorm (LLaMA) |
|---|---|---|
| Subtracts mean | Yes | No |
| Divides by std | Yes (σ) | By RMS instead |
| Learned scale γ | Yes (D params) | Yes (D params) |
| Learned shift β | Yes (D params) | No |
| Total params per layer | 2D | D |
| Speed difference | baseline | ~5% faster |
| Empirical quality | same | same |
| Modern default | No (legacy) | Yes |
β. Dividing by the root-mean-square is almost free on a GPU — it's one fused kernel.There is one more architectural choice to make: where the normalization goes. The original 2017 Transformer used post-norm, which applies normalization after the residual addition: x ← LayerNorm(x + Sublayer(x)). This works for shallow networks but is hard to train deep without elaborate warmup tricks.
Modern transformers use pre-norm instead, applying normalization before the sublayer and adding the residual to the original input: x ← x + Sublayer(LayerNorm(x)). The advantage is that the residual stream x is never normalized, so gradients can flow back through the addition without being modified by the norm layer. This is what makes 100-layer networks trainable. Every modern recipe uses pre-norm with RMSNorm.
A residual connection replaces y = f(x) with y = x + f(x). The output of the layer is the input plus a learned modification, rather than a fresh transformation. This sounds like a small change but it does two important things at once.
The first benefit is making the network trainable at depth. Without residuals, gradients during backprop have to flow through every layer's Jacobian (recall Day 4: the chain rule multiplies Jacobians together). If even a few of those Jacobians shrink the gradient — because activations are saturated, or weights are small — the gradient vanishes by the time it reaches early layers. This is the vanishing gradient problem we discussed on Day 4.
With a residual y = x + f(x), the backward pass through that layer becomes:
There are now two gradient paths: one through ∂f/∂x (the normal path, which might shrink), and one through the identity I (the residual path, which passes the gradient unchanged). Even if ∂f/∂x is nearly zero, the second term keeps the gradient alive. Networks can be 100 layers deep and early layers still receive a healthy gradient signal.
The second benefit is the identity prior. At initialization, the function f typically produces outputs close to zero (with standard weight initialization). The whole layer therefore behaves like the identity: y ≈ x. The model adjusts each layer incrementally during training, rather than replacing its representation outright. This shapes the loss landscape favorably.
Anthropic has a useful framing known as the residual stream. Picture the running activation as a stream of vectors flowing from layer 1 to the output. Each block doesn't transform the stream wholesale — it reads from the stream (via the norm), computes something, and writes its contribution back into the stream by addition. Different layers are decoupled communicators on a shared bus, rather than a strict pipeline.
This view is both intuitively useful and theoretically illuminating. It explains why interpretability research can reason about a Transformer's circuits as compositions of "reads" and "writes" along a single high-dimensional channel (see the Transformer Circuits reference). And it explains why the embedding dimension D is so precious: it is the bandwidth of the shared bus. Every head and every FFN layer has to fit its information into this fixed-width channel.
x is never fully replaced — it accumulates contributions from all layers. This is why gradients have a clean path back through 100 blocks.After attention has mixed information across tokens, the feed-forward sublayer (often called the FFN or MLP) processes each token independently. The standard pattern is "expand and contract": map the D-dimensional input up to a much higher dimension, apply a non-linearity, and then project back down to D.
The first weight matrix W₁ expands from D to 4D. The second matrix W₂ contracts back from 4D to D. The 4× expansion is the canonical choice; it traces back to the original Transformer paper and is consistent across the GPT family. The non-linearity in between is GELU (Gaussian Error Linear Unit), a smooth approximation to ReLU that performs slightly better empirically.
D to 4D, applies GELU element-wise, then contracts back to D. The two projection matrices hold about 8D² parameters — more than the attention sublayer (4D²). Attention mixes across tokens; the FFN computes within each token.LLaMA, Mistral, and most modern open LLMs use a different FFN design called SwiGLU. SwiGLU uses three weight matrices instead of two, with one of them acting as a multiplicative gate.
The output of one projection (after silu) acts as a gate that scales each element of another projection. The gating mechanism gives the layer more expressive power per parameter — SwiGLU consistently outperforms plain GELU FFNs on LLM benchmarks. Because SwiGLU has three matrices instead of two, it needs a smaller hidden dimension (8/3 × D ≈ 2.67× instead of 4×) to keep the parameter count comparable.
| FFN type | Hidden dim | Matrix count | Params (per block) |
|---|---|---|---|
| GELU (GPT-2) | 4D | 2 | 8D² |
| SwiGLU (LLaMA) | ≈8D/3 | 3 | ≈8D² |
It is worth pausing to notice that the FFN, not attention, is the largest parameter component of a Transformer block. For LLaMA-7B, the FFN accounts for roughly two-thirds of all model parameters. A useful intuition is that attention does mixing (it decides which tokens influence which other tokens), while the FFN does computation (it transforms each token's representation in isolation). Most of the model's "knowledge" is stored in the FFN weights — the FFN functions somewhat like a key-value store that looks up factual associations based on the token's current hidden state.
We can now combine attention and the FFN into a single Transformer block, using pre-norm with residuals. Each of the two sublayers (attention and FFN) gets its own normalization and its own residual connection, in this pattern:
Here is the full PyTorch implementation. We use F.scaled_dot_product_attention for the inner attention computation — modern PyTorch automatically dispatches that to FlashAttention on supported hardware, which we will study in detail on Day 21.
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class MLP(nn.Module):
def __init__(self, d_model: int):
super().__init__()
self.fc1 = nn.Linear(d_model, 4 * d_model, bias=False)
self.fc2 = nn.Linear(4 * d_model, d_model, bias=False)
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x)))
class MultiHeadCausalAttention(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
assert d_model % n_heads == 0
self.n_heads = n_heads
self.d_head = d_model // n_heads
self.W_QKV = nn.Linear(d_model, 3 * d_model, bias=False)
self.W_O = nn.Linear(d_model, d_model, bias=False)
def forward(self, x):
B, T, D = x.shape
qkv = self.W_QKV(x)
q, k, v = qkv.chunk(3, dim=-1)
q = q.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
k = k.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
v = v.view(B, T, self.n_heads, self.d_head).transpose(1, 2)
# PyTorch 2.0+ — dispatches to FlashAttention when available.
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
out = out.transpose(1, 2).contiguous().view(B, T, D)
return self.W_O(out)
class Block(nn.Module):
def __init__(self, d_model: int, n_heads: int):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = MultiHeadCausalAttention(d_model, n_heads)
self.norm2 = nn.LayerNorm(d_model)
self.mlp = MLP(d_model)
def forward(self, x):
# Pre-norm + residual: read from the stream, write back to it.
x = x + self.attn(self.norm1(x))
x = x + self.mlp (self.norm2(x))
return x
[B, T, D] is preserved at every arrow. Stack N of these and you have the transformer body.With the Transformer block in hand, the full model is straightforward. We add a token embedding (covered on Day 5), a learned positional embedding, a stack of n_layers blocks, a final layer norm, and an output projection that maps the final hidden state back to vocabulary logits.
class GPT(nn.Module):
def __init__(self, vocab_size: int, d_model: int, n_heads: int,
n_layers: int, max_seq_len: int):
super().__init__()
self.tok_emb = nn.Embedding(vocab_size, d_model)
self.pos_emb = nn.Embedding(max_seq_len, d_model)
self.blocks = nn.ModuleList([
Block(d_model, n_heads) for _ in range(n_layers)
])
self.norm_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying — share storage with the input embedding.
self.lm_head.weight = self.tok_emb.weight
def forward(self, idx):
# idx: (B, T) long tensor of token IDs
B, T = idx.shape
positions = torch.arange(T, device=idx.device)
x = self.tok_emb(idx) + self.pos_emb(positions) # (B, T, D)
for block in self.blocks:
x = block(x)
x = self.norm_f(x)
logits = self.lm_head(x) # (B, T, V)
return logits
Weight tying means the LM head matrix and the token embedding matrix are the same tensor in memory: lm_head.weight = tok_emb.weight. Both layers map between the vocabulary index space and the hidden dimension space, so sharing weights is conceptually natural. In a V=50,257, D=768 model (GPT-2 small), this saves 50,257 × 768 × 4 bytes ≈ 150 MB. You get it for free in PyTorch by assigning the weight attribute.
That is roughly 50 lines of model code, and it is a real, working language model. On Day 9 we will train this on real text and watch loss come down. The architecture has not fundamentally changed since 2017; what has changed is the scale at which we run it.
A complete GPT-style model fits in roughly 150 lines of clean PyTorch — see nanoGPT/model.py. The same architecture, trained at frontier scale, becomes LLaMA-3 405B. The skeleton is essentially the same 150 lines, just with bigger numbers everywhere. The architectural innovation since 2017 has been remarkably modest. What has changed is scale, data, and systems engineering.
Knowing parameter counts cold is one of the most useful skills for an LLM engineer. You will use it constantly to figure out what fits in your hardware, how much VRAM training will need, and how long an inference run will take. The arithmetic is simple if you remember what each component contributes.
Letting D = d_model:
| Component | Params | Explanation |
|---|---|---|
| Attn QKV projection | 3D² | W_QKV maps D → 3D |
| Attn output projection | D² | W_O maps D → D |
| FFN up projection | 4D² | W₁ maps D → 4D |
| FFN down projection | 4D² | W₂ maps 4D → D |
| Two LayerNorms | 4D | γ + β each, negligible |
| Block total | ≈ 12D² | the rule of thumb to memorize |
V × DT_max × D (small, often <1% of total)V × D if untied; effectively zero if tied to the input embeddingGPT-2 small uses D = 768, n_layers = 12, n_heads = 12, V = 50,257, T_max = 1024, with weight tying:
12 × 12 × 768² ≈ 85M(50,257 + 1024) × 768 ≈ 39MLLaMA-7B is more complex because it uses SwiGLU and untied embeddings. Configuration: D = 4096, n_layers = 32, n_heads = 32, V = 32,000, SwiGLU hidden dimension around 11,008, RMSNorm:
4 × 4096² ≈ 67M3 × 4096 × 11008 ≈ 135M~202M. With 32 layers: ~6.5B2 × 32,000 × 4096 ≈ 262MThis kind of arithmetic comes up constantly when you are deciding what fits in memory, what training budget you need, or how many GPUs to provision. Practice it until it's automatic.
Transformers come in three architectural families, distinguished mainly by their attention mask and training objective.
| Family | Examples | Attention | Trained on | Used for |
|---|---|---|---|---|
| Encoder-only | BERT, RoBERTa, DeBERTa | Bidirectional | Masked LM | Classification, NER, retrieval embeddings |
| Decoder-only | GPT, LLaMA, Mistral, Qwen, DeepSeek, Gemma | Causal self-attention | Next-token prediction | Generation (and everything-via-prompting) |
| Encoder-decoder | T5, BART, original Transformer | Encoder bidir + decoder causal + cross-attn | Span corruption, seq2seq | Translation, summarization |
For modern LLMs, decoder-only architectures have decisively won the design competition. Several reasons combined to produce that outcome:
For the rest of this curriculum, when we say "LLM" we mean decoder-only Transformer. Every modern open LLM falls in this family.
The single best test of whether you understand the architecture is to trace shapes through a forward pass. We will use a small configuration: B = 4 sequences in the batch, T = 128 tokens per sequence, vocabulary size V = 32000, hidden dimension D = 512, with n_heads = 8 and n_layers = 6.
idx: (4, 128) int64
tok_emb(idx): (4, 128, 512)
pos_emb(positions): (128, 512) broadcasts to (4, 128, 512)
x = tok + pos: (4, 128, 512)
for each block:
norm1(x): (4, 128, 512) ← no shape change
attn:
W_QKV(x): (4, 128, 1536) ← 3 × D = 3 × 512
chunk -> q,k,v: 3 × (4, 128, 512)
reshape, perm: 3 × (4, 8, 128, 64) ← (B, n_heads, T, d_head)
SDPA(q,k,v): (4, 8, 128, 64)
perm, reshape: (4, 128, 512)
W_O: (4, 128, 512)
x = x + attn: (4, 128, 512) ← residual add
norm2(x): (4, 128, 512)
mlp:
fc1: (4, 128, 2048) ← 4 × D = 4 × 512
GELU: (4, 128, 2048)
fc2: (4, 128, 512)
x = x + mlp: (4, 128, 512) ← residual add
norm_f(x): (4, 128, 512)
lm_head(x): (4, 128, 32000) logits — (B, T, V)
If any shape feels mysterious, read it again. Internalize it. Every modern LLM is some variation of this same shape walk, just with bigger numbers.
Everything we covered today is not just architecture trivia — it is the precise unit that inference engines spend their lives running. Here is why each piece matters from a systems perspective.
A rough but famous rule: running a single new token through a model with N parameters requires approximately 2N floating-point operations. The factor of 2 comes from the multiply-add structure of matrix multiplications. For LLaMA-7B (N ≈ 7 billion), that is roughly 14 GFLOPs per token. On a GPU that can do 80 TFLOPs, you can generate about 5,700 tokens per second per GPU — assuming you are compute-bound. In practice, for small batch sizes you are usually memory-bandwidth bound instead, and actual throughput is lower. We will derive this precisely on Days 16–17.
During a forward pass, every block materializes its intermediate activations: the QKV projections, the attention scores, the FFN hidden states. For a single token batch of LLaMA-7B in float16, a single block's activations are on the order of megabytes. For large batches or long contexts, activation memory becomes significant, which is why gradient checkpointing is essential during training.
For inference, the critical memory structure is the KV cache: for every block, the keys and values from all past tokens are stored so they don't have to be recomputed on each new token. Per block, the KV cache size is 2 × T × D_kv × sizeof(dtype). For LLaMA-7B with a 4096-token context in float16, the KV cache is roughly 32 blocks × 2 × 4096 × 4096 × 2 bytes ≈ 2 GB per sequence. Serving hundreds of concurrent sequences requires careful memory management — this is what PagedAttention (Day 24) addresses.
When you read about FlashAttention (Day 21), quantization (Day 22), speculative decoding (Day 23), or continuous batching and PagedAttention (Day 24), the unit being optimized is always the Transformer block. FlashAttention rewrites the attention sublayer to avoid materializing the full attention matrix. Quantization changes the dtype of the weight matrices inside the block. Tensor parallelism splits the block's matrix multiplications across GPUs. Everything in Weeks 3–4 is the block anatomy you built today, subjected to increasingly sophisticated systems engineering.
The KV cache for a single 128K-token context window on LLaMA-3 70B occupies roughly 40 GB — more than some entire GPU cards. This is why long-context serving is such an active research area, and why Week 4 spends an entire day on memory management.
Companion notebook: day-7-transformer-block.ipynb.
Verify against PyTorch's nn.LayerNorm. The reference implementation:
def layer_norm(x, gamma, beta, eps=1e-5):
mu = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True, unbiased=False)
return gamma * (x - mu) / torch.sqrt(var + eps) + beta
(V=1000, D=128, n_heads=4, n_layers=4, T_max=64). Forward a random input batch of shape (2, 32). Verify the output shape is (2, 32, 1000).12D² rule. Then compute it with sum(p.numel() for p in model.parameters()). Write an assert that checks they agree within 5%. If they don't, find the discrepancy.ln(V) — that is the theoretical value for a randomly initialized model predicting uniformly over the vocabulary.model.generate(idx, max_new_tokens). For each step: forward pass; take logits at the last position; argmax (greedy) or sample; append to the input; repeat. We don't add a KV cache yet — that's Day 20.x before and after a block. Because it is never directly normalized, the scale of x can drift. Is this a problem? What happens if you remove the final norm_f before the LM head?Close the page and answer from memory. If you can't, re-read the relevant section.
D=2048, n_layers=24, V=50000, plain GELU MLP, tied embeddings.(2, 64) through a forward pass and write down the shape after every single sublayer.We make this thing learn. The Week 2 lessons cover pre-training data and objectives, building and training a tiny GPT on real text, distributed training, modern architectural variations (LLaMA, Mistral, MoE), fine-tuning techniques (SFT, LoRA, QLoRA), and alignment methods (RLHF, DPO).
If you want to consolidate before moving on, the best thing you can do is re-implement everything from scratch in a single notebook. Go end to end: tokenizer, model, forward pass on random tokens. Internalize the shapes. The Day 7 companion notebook walks through exactly this build, plus a small TinyShakespeare training run as an optional bonus.
"A 70B-parameter LLM is the same picture you wrote in Week 1 — just with bigger numbers everywhere."
Hand-picked references for this lesson and the Week 1 capstone.
Builds exactly what we built today, on Shakespeare. Spelled out, in code.
Watch on YouTubeRoughly 300 lines of training plus 150 lines of model. Read it like a poem.
View repoProduction-grade version of the same code. Optimizer, scheduler, and DDP details.
Watch on YouTubeSixteen pages of pure pseudocode. Great reference card.
Open paperWhy pre-norm trains stably to many layers. Theoretical and empirical evidence.
Open paperThe architecture every open LLM derives from. RMSNorm, RoPE, SwiGLU, GQA.
Open paperThe "residual stream" view of a Transformer. Beautifully clarifying.
Read postProduction reference: RMSNorm + RoPE + SwiGLU. Read after nanoGPT.
View sourceMLX-native LLM implementations. Same architecture, Apple Silicon idioms.
View repoThe original residual connection paper from computer vision. The same trick made 100-layer CV models trainable, then was adopted wholesale by Transformers.
Open paperThe positional encoding used in LLaMA and most modern LLMs. Replaces the learned absolute position embeddings we use today. Day 12 covers it in detail.
Read post