The GPT-2 you built in Week 1 is a 2019 design. Today you upgrade it to 2024: rotary position embeddings, grouped-query attention, RMSNorm, SwiGLU, mixture-of-experts, and a first look at state-space alternatives. Five changes separate your tiny GPT from LLaMA 3, Mistral, Mixtral, and DeepSeek — and most of them exist specifically to make inference faster, cheaper, and longer-context.
If you opened LLaMA's source code today, the skeleton would be instantly familiar: embeddings, a stack of pre-norm decoder blocks, a final norm, a language-model head. The Day 7 picture holds. But every sub-component has been quietly upgraded since 2019, and the upgrades are not arbitrary — almost every one targets inference efficiency: smaller KV caches, longer context at the same memory budget, cheaper FLOPs per useful parameter, or better quality at a fixed serving cost. As an inference engineer, these design choices determine how much memory a served model needs, how fast it decodes, and how many concurrent requests you can handle.
This lesson traces the evolution of each component in the order you encounter it walking through a modern block: position encoding first (RoPE), then attention variants (MHA → GQA → MQA, MLA), then normalization and the FFN (RMSNorm, SwiGLU), then conditional computation (MoE), and finally a preview of architectures that try to escape quadratic attention entirely (Mamba/SSM). Throughout, we frame each change by its inference consequence.
8/3·D hidden-dimension convention from first principles.Tokens in a sequence have an order that raw attention ignores — two sentences with the same words in different orders should mean different things. The original Transformer paper patched this by adding a sinusoidal position vector to each token embedding before passing it through the blocks. GPT-2 replaced that with a learned absolute embedding: a lookup table of size T_max × D, one vector per position slot, trained end-to-end. Simple, and it works — inside the training-context window.
Two problems bite you at inference time:
ALiBi (Attention with Linear Biases) takes a different approach: instead of encoding position in the embeddings at all, it subtracts a linear penalty from every attention score proportional to the distance between query and key positions.
ALiBi has zero position parameters, extrapolates cleanly to longer sequences than those trained on (the penalty just grows), and is dead simple to implement. MPT and BLOOM use it. Its weakness: it doesn't encode relative position with as much expressivity as RoPE, and in practice it underperforms RoPE on tasks that require attending to content far away in the sequence.
Almost every 2023-onward model uses Rotary Position Embedding (RoPE). The idea: instead of adding a position vector to the residual stream, rotate the query and key vectors by an angle proportional to their position, applied in 2D coordinate pairs. The key insight is algebraic: after rotation, the dot product q_m · k_n depends only on the content and the relative offset m − n, never on the absolute values of m and n separately. Attention is natively relative — which is what we wanted all along.
Why does the math work out? When you expand the dot product of two rotated vectors, cross terms that involve the absolute positions m and n individually cancel out — you're left with terms that only involve m − n. This is a consequence of the rotation group's structure: rotating both vectors by the same angle and then taking the dot product gives the same result as not rotating either.
RoPE also has a natural frequency interpretation: high-frequency components (small i) encode fine-grained local position; low-frequency components (large i, near the end of the head dimension) encode coarse global position. This is analogous to Fourier series — you're using a bank of sinusoids at different frequencies to embed position in the complex plane.
m − n survives. No learned parameters; works at any context length.RoPE's base hyperparameter controls how quickly the frequencies rotate. A small base (10000) means high-frequency components complete a full rotation within a few tokens — useful for short contexts but aliased at long ones. To serve a model at 4× its training context length without full retraining, you can interpolate: stretch the position indices to fit (position interpolation), or rescale the base (NTK-aware scaling). YaRN combines both, additionally boosting the attention logit scale for far-apart pairs. These tricks let an 8K-trained model serve at 128K with fine-tuning on only a small fraction of long-context examples.
| Method | Mechanism | Extrapolation | Parameters | Used by |
|---|---|---|---|---|
| Learned absolute | Lookup table, add to residual | None (hard cutoff) | T_max × D learned | GPT-2, early BERT |
| Sinusoidal | Fixed sin/cos, add to residual | Poor | 0 | Original Transformer |
| ALiBi | Linear distance penalty on attention score | Good | 0 | MPT, BLOOM |
| RoPE | Rotate q/k by position angle | Good (+ NTK/YaRN) | 0 | LLaMA, Mistral, Qwen, Gemma, DeepSeek |
This section is the most important one for a serving engineer. Slow down here. In standard multi-head attention (MHA), each of the H heads has its own query, key, and value projection. During autoregressive generation we cache the keys and values for every past token — the KV cache. That cache stores 2 × H × d_head floats per token, per layer. At inference time, the KV cache is often the dominant memory consumer, easily exceeding the model weights for long contexts and large batches.
The fix is obvious once you see it: share the key and value heads. Queries can remain diverse (you want expressive query-side representation), but keys and values can be shared across groups of query heads without sacrificing much quality. You're compressing the part of attention that drives memory, not the part that drives expressiveness.
H query heads, H key heads, H value heads. The original design from "Attention is All You Need". Maximum expressiveness. Maximum KV-cache size.H query heads, 1 key head, 1 value head — shared by all queries. KV cache is H× smaller than MHA. Measurable quality loss at large scale; fine at smaller scales. Used in some fast decoders (PaLM, Falcon).H query heads grouped into G groups, each group sharing 1 KV head. Cache is H/G× smaller than MHA. Quality nearly matches MHA. This is the current standard — LLaMA 2 70B, LLaMA 3, Mistral, Gemma 2, Qwen 2.5 all use GQA.The formula is simple. For a model with L layers, n_kv KV heads each of dimension d_head, serving a batch of B sequences of length T in fp16 (2 bytes):
That 4× difference — 34 GB vs 137 GB — is the difference between a deployable serving system and an impossible one. GQA is not a minor optimization; it's a prerequisite for serving large models at scale.
| Variant | n_kv_heads | KV cache (LLaMA-3 8B, T=8K, B=16) | Quality vs MHA | Notes |
|---|---|---|---|---|
| MHA | 32 | 137 GB | Baseline | Original; infeasible at scale |
| GQA-8 | 8 | 34 GB | ~Same | LLaMA-3 standard; sweet spot |
| GQA-4 | 4 | 17 GB | Slight loss | More aggressive sharing |
| MQA | 1 | 4 GB | Noticeable loss | Fast decoders, small models |
Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, pushes the idea further. Instead of storing full KV tensors in the cache, MLA compresses them into a low-rank latent vector that is much smaller than the full KV pair. The keys and values are reconstructed at attention time from these latent vectors via learned up-projection matrices. The cached latent is a fraction of the size of a GQA KV cache while recovering most of MHA's expressiveness. This is how DeepSeek-V3 achieves strong quality with very small cache footprints.
Both changes appeared in the original LLaMA paper and have since become universal. They're relatively small increments over Day 7's baseline, but every modern model uses them, so let's understand each one precisely.
LayerNorm (Day 7) normalizes each token vector to zero mean and unit variance, then re-scales with learned γ and β. RMSNorm drops the mean-subtraction step and the bias β entirely. You only divide by the root-mean-square of the activations, then scale:
Why is dropping the mean safe? The intuition is that the mean-centering in LayerNorm is largely redundant with the bias terms elsewhere in the network. Empirically, RMSNorm matches LayerNorm quality while being measurably faster. At large scale (thousands of norm calls per forward pass), those 7% savings compound.
Inference implication: norm calls are memory-bandwidth bound, not compute-bound. Removing the mean-computation reduces the number of passes over the activation tensor. At long sequence lengths (where activations are large) and when running on memory-bandwidth-limited hardware (consumer GPUs), this matters more than it looks.
The GPT-2 FFN was a two-matrix sandwich: project up to 4D, apply GELU, project back to D:
SwiGLU adds a third matrix and uses it as a multiplicative gate on the main path:
Why does gating help? The gate (SiLU output) acts as a soft switch that can suppress features that are irrelevant for the current token. This gives the FFN more selectivity: it can represent sharper, more context-dependent feature activations than a plain GELU FFN of equal parameter count. SwiGLU consistently improves perplexity at matched parameter budgets in Shazeer's ablations — not by a huge amount, but reliably.
Inference implication of SwiGLU: the FFN is often the largest compute cost in a block (at batch size 1, it is the dominant memory-bandwidth consumer). SwiGLU doesn't change the asymptotic cost, but it does require a third matrix multiply, which changes how FFN layers tile onto hardware. Fused SwiGLU kernels (available in xformers, flash-attention, and vLLM's fused MLP) keep it efficient.
So far every change kept the model dense — every parameter participates in every token. Mixture of Experts (MoE) breaks that. It replaces the single FFN in a block with E parallel FFN "experts" plus a small router network. For each token, the router picks the top k experts (typically k=2 out of E=8 or more), and only those run. The token's output is a weighted combination of its chosen experts.
Think of it this way: a dense 47B-parameter model has one very large FFN per block. A 47B MoE model has eight 6B-parameter FFNs per block, but only two of them run per token — giving you the knowledge capacity of a much larger model at roughly the compute cost of a smaller one.
Without any constraint, the router quickly learns to route most tokens to a small subset of experts. This is catastrophically bad: the other experts don't get gradients, don't specialize, and become wasted parameters. The standard fix is an auxiliary load-balancing loss (introduced in Switch Transformer):
The load-balancing loss does not dominate training (the coefficient α is small), but without it the router collapses within a few hundred steps. With it, experts specialize — different experts end up handling different token types (syntax, factual recall, reasoning patterns).
GPT-4 is widely believed to be a mixture-of-experts model — unconfirmed estimates suggest ~1.8T total parameters across ~16 experts with ~220B active per token. Whether or not the exact numbers are right, the pattern is now standard at the frontier: Mixtral, DeepSeek-V3, Qwen-MoE, and Grok are all MoE. Conditional computation is the mechanism by which you build a "trillion-parameter" model you can actually afford to run. DeepSeek-V3's fine-grained MoE uses 256 experts per layer with only top-8 active — taking the idea much further than Mixtral's 8×7B.
Attention's KV cache is one inference tax; its quadratic compute cost is another. At sequence length T, computing full attention requires O(T²·d) FLOPs and O(T²) memory for the attention matrix. At T=128K tokens, this is expensive — 16× the memory and FLOPs of T=32K. The KV cache partially alleviates the memory cost during token-by-token generation (each step is O(T·d)), but prefill (processing the prompt) still runs quadratic attention over the whole sequence.
State-space models model a sequence as a linear dynamical system: a hidden state h_t is updated recurrently as each new input x_t arrives. At inference time this is O(1) per step (just update the state); at training time SSMs can be parallelized via parallel scan (O(T log T) or O(T) depending on implementation).
Mamba (Gu & Dao, 2023) makes SSM parameters input-dependent (selective) and adds hardware-efficient parallel scan kernels. It achieves competitive quality with Transformers at sub-quadratic cost on long sequences. Mamba-2 refines the state-space structure further.
Despite the theoretical appeal, SSMs have not displaced attention in production. The reasons are practical:
For inference engineers: if you serve a Mamba or hybrid model, the KV cache is replaced by or supplemented with a recurrent state cache — a fixed-size state vector per layer. This scales as O(1) in memory per additional generated token, which is attractive for very long generation runs.
Here is the shape of a modern model config (LLaMA-3 8B style). You now know what every line means.
dim: 4096 # d_model (D)
n_layers: 32 # decoder blocks
n_heads: 32 # query heads (H)
n_kv_heads: 8 # GQA: 8 KV groups → 4× smaller KV cache
vocab_size: 128256 # bigger tokenizer than GPT-2's 50k
ffn_dim: 14336 # SwiGLU hidden ≈ (8/3)·D, rounded to nice multiple
norm: RMSNorm # not LayerNorm
norm_eps: 1e-5
position: RoPE # rotary, not learned absolute
rope_theta: 500000 # large base → better long-context extrapolation
activation: SiLU # the gate nonlinearity in SwiGLU
tied_embeddings: false # separate LM head weights
| Family | Pos. Encoding | Norm | Attention | FFN | Sparse? |
|---|---|---|---|---|---|
| GPT-2 (2019) | Learned abs. | LayerNorm | MHA | GELU 4D | Dense |
| LLaMA 1/2 | RoPE (base 10K) | RMSNorm | GQA (70B only) | SwiGLU 8/3D | Dense |
| LLaMA 3 | RoPE (base 500K) | RMSNorm | GQA-8 | SwiGLU 14336 | Dense |
| Mistral 7B | RoPE | RMSNorm | GQA + sliding window | SwiGLU | Dense |
| Mixtral 8×7B | RoPE | RMSNorm | GQA | SwiGLU experts | MoE (8, top-2) |
| DeepSeek-V3 | RoPE | RMSNorm | MLA | SwiGLU experts | MoE (256, top-8) |
| Jamba (hybrid) | RoPE | RMSNorm | Attn + Mamba | SwiGLU / MoE | Hybrid + MoE |
The columns are today's lesson. The direction of travel is unmistakable: every new model pushes further toward smaller KV-cache footprints (GQA → MLA), more capacity per compute dollar (dense → MoE), and longer context at manageable cost (larger RoPE base, NTK/YaRN). These are inference engineering problems wearing a training-time hat.
Companion notebook: day-12-modern-architectures.ipynb.
build_rope_cache and apply_rope(x, positions) for a (B, H, T, d_head) tensor. Verify the relative-position property numerically: the q·k score at positions (m, n) must equal that at (m+s, n+s) for any shift s.m·θ_i for several values of i as a function of position m. Observe the high-frequency / low-frequency structure (like a Fourier basis).nn.Modules. Confirm that SwiGLU at hidden ≈8/3·D has about the same parameter count as a GELU FFN at 4D. Measure forward-pass time for both.n_kv_heads < n_heads by repeating KV heads. Confirm it reduces to MHA when n_kv_heads == n_heads and MQA when == 1.kv_cache_bytes(n_layers, n_kv_heads, d_head, seq_len, batch, dtype_bytes). Print a table for MHA vs GQA-8 vs MQA for a LLaMA-3 8B-shaped model at T=8K and T=128K, batch sizes 1 and 16.T ∈ [1K, 2K, 4K, 8K, 16K, 32K, 128K] against the linear SSM baseline.Close the page and answer from memory. If you cannot, re-read the relevant section.
v left unrotated?8/3·D instead of 4D?"Most 'modern architecture' is inference engineering wearing a training hat. GQA, RoPE's long-context extensions, and MoE exist because someone has to serve the thing — on hardware that exists today."
The papers and code behind each upgrade.
RMSNorm + RoPE + SwiGLU + GQA in one influential package. The canonical reference for the modern dense transformer recipe.
Open paperThe rotary position embedding, derived and motivated. Includes the proof that the attention score is a function of relative position only.
Open paperGrouped-query attention: the principled interpolation between MHA and MQA with ablations at T5 scale.
Open paperThe original "share one KV head" idea, motivated by fast autoregressive decoding.
Open paper"GLU Variants Improve Transformer." Where SwiGLU comes from — empirical ablation over gating functions.
Open paperSparse MoE at 8×7B: 47B total, 13B active. The canonical modern MoE reference.
Open paperThe original mixture-of-experts layer with top-k routing and the load-balancing loss.
Open paperSelective state-space model with hardware-efficient parallel scan. The leading SSM architecture alternative to Transformers.
Open paperNTK-aware and YaRN scaling of RoPE for long-context extension without full retraining.
Open paperProduction reference: RoPE + RMSNorm + SwiGLU + GQA in one readable file. Cross-reference with today's derivations.
View sourceMulti-Head Latent Attention: low-rank KV compression that reduces KV cache size far beyond GQA while recovering most of MHA quality.
Open paper