LLM Inference Engineer · Day 12
Day 12 · Week 2 · Training & Architectures
🦙

Modern Architectures: LLaMA, Mistral, MoE

The GPT-2 you built in Week 1 is a 2019 design. Today you upgrade it to 2024: rotary position embeddings, grouped-query attention, RMSNorm, SwiGLU, mixture-of-experts, and a first look at state-space alternatives. Five changes separate your tiny GPT from LLaMA 3, Mistral, Mixtral, and DeepSeek — and most of them exist specifically to make inference faster, cheaper, and longer-context.

Time~200 min
DifficultyMedium-Hard
PrerequisiteDays 6–7, 11
Why This Lesson

The architecture barely changed. The details changed a lot — and they're inference details.

If you opened LLaMA's source code today, the skeleton would be instantly familiar: embeddings, a stack of pre-norm decoder blocks, a final norm, a language-model head. The Day 7 picture holds. But every sub-component has been quietly upgraded since 2019, and the upgrades are not arbitrary — almost every one targets inference efficiency: smaller KV caches, longer context at the same memory budget, cheaper FLOPs per useful parameter, or better quality at a fixed serving cost. As an inference engineer, these design choices determine how much memory a served model needs, how fast it decodes, and how many concurrent requests you can handle.

This lesson traces the evolution of each component in the order you encounter it walking through a modern block: position encoding first (RoPE), then attention variants (MHA → GQA → MQA, MLA), then normalization and the FFN (RMSNorm, SwiGLU), then conditional computation (MoE), and finally a preview of architectures that try to escape quadratic attention entirely (Mamba/SSM). Throughout, we frame each change by its inference consequence.

Learning objectives

  1. Trace the evolution of positional encoding from absolute/learned to RoPE to ALiBi, and explain why RoPE became the dominant choice.
  2. Implement RoPE and verify the relative-position property mathematically and in code.
  3. Explain how MHA, GQA, and MQA differ in KV-head count, and quantify the KV-cache memory savings for a realistic model configuration.
  4. Describe MLA (DeepSeek) and why low-rank KV compression is a further step in the same direction.
  5. Implement SwiGLU and justify the 8/3·D hidden-dimension convention from first principles.
  6. Explain mixture-of-experts: routing, top-k, total vs active parameters, load balancing, and inference-time implications.
  7. Explain why state-space models are attractive as attention alternatives and where they currently fall short.
  8. Read the config of LLaMA 3, Mistral, Mixtral, or DeepSeek and map every field to a concept from this lesson.
Positional Encoding — Evolution

From "just add a lookup table" to rotation-based relative position.

Tokens in a sequence have an order that raw attention ignores — two sentences with the same words in different orders should mean different things. The original Transformer paper patched this by adding a sinusoidal position vector to each token embedding before passing it through the blocks. GPT-2 replaced that with a learned absolute embedding: a lookup table of size T_max × D, one vector per position slot, trained end-to-end. Simple, and it works — inside the training-context window.

Two problems bite you at inference time:

  1. No extrapolation. Position 8193 has no lookup entry if you trained on 8192-token contexts. Stitching in a new learned vector at run time doesn't generalize well.
  2. Position lives in the wrong place. Adding position to the residual stream mixes it with content; the attention mechanism has to learn to disentangle them. It works, but it's noisy.

ALiBi — simple relative bias, no learned parameters

ALiBi (Attention with Linear Biases) takes a different approach: instead of encoding position in the embeddings at all, it subtracts a linear penalty from every attention score proportional to the distance between query and key positions.

ALiBi score (i, j) = q_i · k_j / √d − m · (i − j) m = head-specific slope (powers of 2: 1/2, 1/4, … 1/2^H) penalty grows linearly with distance → attends to nearby tokens more

ALiBi has zero position parameters, extrapolates cleanly to longer sequences than those trained on (the penalty just grows), and is dead simple to implement. MPT and BLOOM use it. Its weakness: it doesn't encode relative position with as much expressivity as RoPE, and in practice it underperforms RoPE on tasks that require attending to content far away in the sequence.

RoPE — the winner

Almost every 2023-onward model uses Rotary Position Embedding (RoPE). The idea: instead of adding a position vector to the residual stream, rotate the query and key vectors by an angle proportional to their position, applied in 2D coordinate pairs. The key insight is algebraic: after rotation, the dot product q_m · k_n depends only on the content and the relative offset m − n, never on the absolute values of m and n separately. Attention is natively relative — which is what we wanted all along.

For vector pair (x_{2i}, x_{2i+1}) at position m, apply rotation by angle m·θ_i: θ_i = base^(−2i/d) base = 10000 (GPT-style), or 500000 (LLaMA-3) [x_{2i} ] [ cos(m·θ_i) −sin(m·θ_i) ] [x_{2i} ] [x_{2i+1}] → [ sin(m·θ_i) cos(m·θ_i) ] [x_{2i+1}] ⟨RoPE(q, m), RoPE(k, n)⟩ = f(q, k, m−n) ← depends only on relative offset Applied to q and k inside every attention layer, before the dot product. v is left unrotated. No learned parameters.

Why does the math work out? When you expand the dot product of two rotated vectors, cross terms that involve the absolute positions m and n individually cancel out — you're left with terms that only involve m − n. This is a consequence of the rotation group's structure: rotating both vectors by the same angle and then taking the dot product gives the same result as not rotating either.

RoPE also has a natural frequency interpretation: high-frequency components (small i) encode fine-grained local position; low-frequency components (large i, near the end of the head dimension) encode coarse global position. This is analogous to Fourier series — you're using a bank of sinusoids at different frequencies to embed position in the complex plane.

RoPE: rotating (q, k) pairs by position → relative attention position m q rotated by m·θ position n k rotated by n·θ dot product = f(q, k, m−n) only relative offset matters High-freq components (small i) encode local position; low-freq (large i) encode global — like a Fourier basis.
RoPE rotates each 2D coordinate pair of q and k by an angle proportional to position. When you compute the dot product, the absolute position cancels and only the relative offset m − n survives. No learned parameters; works at any context length.

Long-context extension: NTK / YaRN

RoPE's base hyperparameter controls how quickly the frequencies rotate. A small base (10000) means high-frequency components complete a full rotation within a few tokens — useful for short contexts but aliased at long ones. To serve a model at 4× its training context length without full retraining, you can interpolate: stretch the position indices to fit (position interpolation), or rescale the base (NTK-aware scaling). YaRN combines both, additionally boosting the attention logit scale for far-apart pairs. These tricks let an 8K-trained model serve at 128K with fine-tuning on only a small fraction of long-context examples.

MethodMechanismExtrapolationParametersUsed by
Learned absoluteLookup table, add to residualNone (hard cutoff)T_max × D learnedGPT-2, early BERT
SinusoidalFixed sin/cos, add to residualPoor0Original Transformer
ALiBiLinear distance penalty on attention scoreGood0MPT, BLOOM
RoPERotate q/k by position angleGood (+ NTK/YaRN)0LLaMA, Mistral, Qwen, Gemma, DeepSeek
Attention Variants for Inference

MHA → GQA → MQA: sharing KV heads is the single biggest inference win.

This section is the most important one for a serving engineer. Slow down here. In standard multi-head attention (MHA), each of the H heads has its own query, key, and value projection. During autoregressive generation we cache the keys and values for every past token — the KV cache. That cache stores 2 × H × d_head floats per token, per layer. At inference time, the KV cache is often the dominant memory consumer, easily exceeding the model weights for long contexts and large batches.

The fix is obvious once you see it: share the key and value heads. Queries can remain diverse (you want expressive query-side representation), but keys and values can be shared across groups of query heads without sacrificing much quality. You're compressing the part of attention that drives memory, not the part that drives expressiveness.

The three variants

  • Multi-Head Attention (MHA): H query heads, H key heads, H value heads. The original design from "Attention is All You Need". Maximum expressiveness. Maximum KV-cache size.
  • Multi-Query Attention (MQA): H query heads, 1 key head, 1 value head — shared by all queries. KV cache is smaller than MHA. Measurable quality loss at large scale; fine at smaller scales. Used in some fast decoders (PaLM, Falcon).
  • Grouped-Query Attention (GQA): H query heads grouped into G groups, each group sharing 1 KV head. Cache is H/G× smaller than MHA. Quality nearly matches MHA. This is the current standard — LLaMA 2 70B, LLaMA 3, Mistral, Gemma 2, Qwen 2.5 all use GQA.
MHA 8 Q · 8 KV heads Q heads KV heads KV cache: 8× (baseline) GQA 8 Q · 2 KV heads (groups of 4) Q heads KV heads KV cache: 2× (4× savings vs MHA) MQA 8 Q · 1 KV head (shared) Q heads KV cache: 1× (8× savings vs MHA) KV cache bytes = 2 × L × n_kv_heads × d_head × T × B × dtype_bytes LLaMA-3 8B (L=32, d_head=128, T=8192, B=16, fp16): MHA=128 GB · GQA-8=32 GB · MQA=4 GB GQA is the standard: close to MHA quality, most of the memory savings.
MHA, GQA, and MQA differ only in how many key/value heads are kept. Each KV head must be stored for every past token in the sequence — so fewer KV heads means a smaller KV cache, directly enabling longer contexts and larger batches at the same GPU memory budget.

KV cache memory: putting numbers on it

The formula is simple. For a model with L layers, n_kv KV heads each of dimension d_head, serving a batch of B sequences of length T in fp16 (2 bytes):

KV cache bytes = 2 (K+V) × L × n_kv × d_head × T × B × 2 bytes LLaMA-3 8B: L=32, d_head=128, H=32 query heads, n_kv=8 (GQA) At T=8192, B=16: 2×32×8×128×8192×16×2 = 34.4 GB (GQA-8) vs MHA (n_kv=32): 4× larger = 137.4 GB — would not fit on an 80 GB GPU at all. vs MQA (n_kv=1): 34.4 / 8 = 4.3 GB — tiny, but quality suffers.

That 4× difference — 34 GB vs 137 GB — is the difference between a deployable serving system and an impossible one. GQA is not a minor optimization; it's a prerequisite for serving large models at scale.

Variantn_kv_headsKV cache (LLaMA-3 8B, T=8K, B=16)Quality vs MHANotes
MHA32137 GBBaselineOriginal; infeasible at scale
GQA-8834 GB~SameLLaMA-3 standard; sweet spot
GQA-4417 GBSlight lossMore aggressive sharing
MQA14 GBNoticeable lossFast decoders, small models

MLA — DeepSeek's further compression

Multi-Head Latent Attention (MLA), introduced in DeepSeek-V2, pushes the idea further. Instead of storing full KV tensors in the cache, MLA compresses them into a low-rank latent vector that is much smaller than the full KV pair. The keys and values are reconstructed at attention time from these latent vectors via learned up-projection matrices. The cached latent is a fraction of the size of a GQA KV cache while recovering most of MHA's expressiveness. This is how DeepSeek-V3 achieves strong quality with very small cache footprints.

Normalization & Activations

RMSNorm drops half the computation. SwiGLU adds a multiplicative gate that punches above its parameter count.

Both changes appeared in the original LLaMA paper and have since become universal. They're relatively small increments over Day 7's baseline, but every modern model uses them, so let's understand each one precisely.

RMSNorm — drop the centering, keep the scale

LayerNorm (Day 7) normalizes each token vector to zero mean and unit variance, then re-scales with learned γ and β. RMSNorm drops the mean-subtraction step and the bias β entirely. You only divide by the root-mean-square of the activations, then scale:

LayerNorm(x) = γ · (x − μ) / √(σ² + ε) + β # 2 stats, 2 learned vectors RMSNorm(x) = γ · x / √(mean(x²) + ε) # 1 stat, 1 learned vector Savings: ~7% fewer FLOPs per norm call, one fewer D-dimensional parameter vector. Applied pre-norm (before attention and before FFN), every layer.

Why is dropping the mean safe? The intuition is that the mean-centering in LayerNorm is largely redundant with the bias terms elsewhere in the network. Empirically, RMSNorm matches LayerNorm quality while being measurably faster. At large scale (thousands of norm calls per forward pass), those 7% savings compound.

Inference implication: norm calls are memory-bandwidth bound, not compute-bound. Removing the mean-computation reduces the number of passes over the activation tensor. At long sequence lengths (where activations are large) and when running on memory-bandwidth-limited hardware (consumer GPUs), this matters more than it looks.

SwiGLU — gated feed-forward network

The GPT-2 FFN was a two-matrix sandwich: project up to 4D, apply GELU, project back to D:

GELU FFN(x) = W_down · GELU(W_up · x) # 2 matrices, hidden = 4D

SwiGLU adds a third matrix and uses it as a multiplicative gate on the main path:

SwiGLU(x) = W_down · ( SiLU(W_gate · x) ⊙ W_up · x ) SiLU(z) = z · σ(z) # Sigmoid-weighted Linear Unit (smooth gating) ⊙ = elementwise multiply Gate path: W_gate projects x to hidden, applies SiLU (values near 0–1 range) Value path: W_up projects x to hidden (unrestricted) These two are multiplied elementwise, then projected back down. hidden dim convention: ≈ (8/3)·D rounded to nearest multiple of 64 or 256 Reason: 3 matrices × (8/3·D) ≈ 2 matrices × 4D in parameter count.

Why does gating help? The gate (SiLU output) acts as a soft switch that can suppress features that are irrelevant for the current token. This gives the FFN more selectivity: it can represent sharper, more context-dependent feature activations than a plain GELU FFN of equal parameter count. SwiGLU consistently improves perplexity at matched parameter budgets in Shazeer's ablations — not by a huge amount, but reliably.

SwiGLU Feed-Forward Block x (B, T, D) W_gate D → h SiLU z·σ(z) W_up D → h W_down h → D out (B, T, D) h ≈ (8/3)·D so that 3 × D×h ≈ 2 × D×4D e.g. D=4096: h=11008 (LLaMA-3 8B uses 14336) GELU FFN: 2×4096×16384 = 134M params SwiGLU: 3×4096×11008 = 135M params ✓
SwiGLU splits the input into two parallel projections: one gating path (through SiLU) and one value path. Their elementwise product selectively amplifies or suppresses features before the final projection down. Three matrices instead of two, so the hidden dimension is ~8/3·D to keep parameter parity.

Inference implication of SwiGLU: the FFN is often the largest compute cost in a block (at batch size 1, it is the dominant memory-bandwidth consumer). SwiGLU doesn't change the asymptotic cost, but it does require a third matrix multiply, which changes how FFN layers tile onto hardware. Fused SwiGLU kernels (available in xformers, flash-attention, and vLLM's fused MLP) keep it efficient.

Mixture of Experts

Many FFN "experts", but each token uses only a few. Big capacity, small compute.

So far every change kept the model dense — every parameter participates in every token. Mixture of Experts (MoE) breaks that. It replaces the single FFN in a block with E parallel FFN "experts" plus a small router network. For each token, the router picks the top k experts (typically k=2 out of E=8 or more), and only those run. The token's output is a weighted combination of its chosen experts.

Think of it this way: a dense 47B-parameter model has one very large FFN per block. A 47B MoE model has eight 6B-parameter FFNs per block, but only two of them run per token — giving you the knowledge capacity of a much larger model at roughly the compute cost of a smaller one.

router_logits = x · W_router # (B×T, E) — one logit per expert top-k experts = top_k(softmax(router_logits), k) gate_weights = renormalize(top-k probs) # weights sum to 1 y = Σ_{e ∈ top-k} gate_e · Expert_e(x) # only k experts compute Example — Mixtral 8×7B: E=8 experts, k=2 active per token. Dense equivalent per block: ~2×7B = 14B active params Total parameter count: ~47B (8 experts, some shared layers) Active per token: ~13B ≈ the dense 13B compute cost, 47B knowledge.
Mixture-of-Experts: sparse routing through FFN experts token x (D,) Router softmax(xW_r) → (E,) probs Expert 1 ACTIVE Expert 2 idle Expert 3 ACTIVE Expert 4 idle E experts total (e.g. 8) Σ gated average (k=2 of E=4 shown) output (D,) Compute vs Memory Pays k experts' FLOPs per token Holds ALL E experts in VRAM Mixtral 8×7B: 47B total params ~13B active/token ≈ 13B compute > 47B quality 47B memory use
The MoE router sends each token to its top-k experts (solid lines) while the rest stay idle (dashed). Only the active experts incur compute — but all expert weights must reside in memory. MoE decouples model capacity (total parameters) from per-token compute (active parameters).

Load balancing: why you need the auxiliary loss

Without any constraint, the router quickly learns to route most tokens to a small subset of experts. This is catastrophically bad: the other experts don't get gradients, don't specialize, and become wasted parameters. The standard fix is an auxiliary load-balancing loss (introduced in Switch Transformer):

L_lb = α · E · Σ_e f_e · p_e f_e = fraction of tokens routed to expert e (from the top-k selection) p_e = mean softmax probability assigned to expert e (differentiable) α = small coefficient (0.01 typical) This loss is minimized when all experts receive equal probability mass AND equal token assignments — encouraging balanced, diverse expert utilization.

The load-balancing loss does not dominate training (the coefficient α is small), but without it the router collapses within a few hundred steps. With it, experts specialize — different experts end up handling different token types (syntax, factual recall, reasoning patterns).

Inference implications of MoE — the headaches

  • All experts must be in memory. You save FLOPs, not VRAM. Every expert's weights sit in GPU memory even though most are idle for any given token. Mixtral 8×7B needs the same memory as a 47B dense model, even though you're computing at 13B speed. This is why MoE serving requires model parallelism, typically expert parallelism where different GPUs hold different experts.
  • Routing is dynamic and irregular. Unlike dense models where every layer processes every token identically, MoE routing produces different experts for different tokens. This complicates batching: tokens in the same batch may need different experts, causing load imbalance across GPUs. Serving frameworks handle this with capacity factors and token dropping.
  • Expert parallelism adds communication overhead. With experts on different GPUs, you need all-to-all communication to dispatch tokens to the right GPU and collect results. This is the dominant overhead in large MoE serving at high batch sizes.

GPT-4 is widely believed to be a mixture-of-experts model — unconfirmed estimates suggest ~1.8T total parameters across ~16 experts with ~220B active per token. Whether or not the exact numbers are right, the pattern is now standard at the frontier: Mixtral, DeepSeek-V3, Qwen-MoE, and Grok are all MoE. Conditional computation is the mechanism by which you build a "trillion-parameter" model you can actually afford to run. DeepSeek-V3's fine-grained MoE uses 256 experts per layer with only top-8 active — taking the idea much further than Mixtral's 8×7B.

Beyond Attention — SSMs and Hybrid Models

Attention is quadratic in sequence length. State-space models try to be linear.

Attention's KV cache is one inference tax; its quadratic compute cost is another. At sequence length T, computing full attention requires O(T²·d) FLOPs and O(T²) memory for the attention matrix. At T=128K tokens, this is expensive — 16× the memory and FLOPs of T=32K. The KV cache partially alleviates the memory cost during token-by-token generation (each step is O(T·d)), but prefill (processing the prompt) still runs quadratic attention over the whole sequence.

Compute Cost vs Sequence Length: Attention vs SSM 0 8K 32K 64K 128K Sequence length (tokens) Relative cost Attention O(T²) (prefill + KV cache) SSM O(T) linear in T gap widens rapidly beyond 32K Attention (quadratic prefill) SSM / linear attention
Attention's prefill cost scales quadratically with sequence length — doubling the sequence quadruples the cost. State-space models like Mamba process sequences in linear time and constant recurrent state, making them attractive for very long contexts. The tradeoff: they compress history into a fixed-size state, which may miss long-range details that attention retrieves exactly.

State-Space Models (SSM) and Mamba

State-space models model a sequence as a linear dynamical system: a hidden state h_t is updated recurrently as each new input x_t arrives. At inference time this is O(1) per step (just update the state); at training time SSMs can be parallelized via parallel scan (O(T log T) or O(T) depending on implementation).

SSM update rule (simplified): h_t = A · h_{t-1} + B · x_t # state update (linear) y_t = C · h_t # output A, B, C are learned matrices (often structured for efficiency). The state h_t has fixed size d_state — it compresses all history. This is exactly an RNN, but with careful design it trains stably.

Mamba (Gu & Dao, 2023) makes SSM parameters input-dependent (selective) and adds hardware-efficient parallel scan kernels. It achieves competitive quality with Transformers at sub-quadratic cost on long sequences. Mamba-2 refines the state-space structure further.

Why attention still dominates

Despite the theoretical appeal, SSMs have not displaced attention in production. The reasons are practical:

  • Recall fidelity. Attention retrieves exact past tokens from the KV cache. An SSM compresses history into a fixed-size state — it may forget or blur content that attention would retrieve perfectly.
  • Quality gap. At the scales frontier models operate at, Transformer quality remains higher for most tasks. SSMs tend to underperform on in-context learning and retrieval-heavy benchmarks.
  • Hybrid models. The most promising direction is hybrid architectures: interleave attention layers (for precise recall) with SSM/MLP layers (for efficient processing). Jamba (AI21) and Zamba use this approach.

For inference engineers: if you serve a Mamba or hybrid model, the KV cache is replaced by or supplemented with a recurrent state cache — a fixed-size state vector per layer. This scales as O(1) in memory per additional generated token, which is attractive for very long generation runs.

Putting It Together

Read a real config and map every field to a concept.

Here is the shape of a modern model config (LLaMA-3 8B style). You now know what every line means.

dim:                4096      # d_model (D)
n_layers:           32        # decoder blocks
n_heads:            32        # query heads (H)
n_kv_heads:         8         # GQA: 8 KV groups → 4× smaller KV cache
vocab_size:         128256    # bigger tokenizer than GPT-2's 50k
ffn_dim:            14336     # SwiGLU hidden ≈ (8/3)·D, rounded to nice multiple
norm:               RMSNorm   # not LayerNorm
norm_eps:           1e-5
position:           RoPE      # rotary, not learned absolute
rope_theta:         500000    # large base → better long-context extrapolation
activation:         SiLU      # the gate nonlinearity in SwiGLU
tied_embeddings:    false     # separate LM head weights
FamilyPos. EncodingNormAttentionFFNSparse?
GPT-2 (2019)Learned abs.LayerNormMHAGELU 4DDense
LLaMA 1/2RoPE (base 10K)RMSNormGQA (70B only)SwiGLU 8/3DDense
LLaMA 3RoPE (base 500K)RMSNormGQA-8SwiGLU 14336Dense
Mistral 7BRoPERMSNormGQA + sliding windowSwiGLUDense
Mixtral 8×7BRoPERMSNormGQASwiGLU expertsMoE (8, top-2)
DeepSeek-V3RoPERMSNormMLASwiGLU expertsMoE (256, top-8)
Jamba (hybrid)RoPERMSNormAttn + MambaSwiGLU / MoEHybrid + MoE

The columns are today's lesson. The direction of travel is unmistakable: every new model pushes further toward smaller KV-cache footprints (GQA → MLA), more capacity per compute dollar (dense → MoE), and longer context at manageable cost (larger RoPE base, NTK/YaRN). These are inference engineering problems wearing a training-time hat.

Exercise

Eight exercises, all in the notebook.

Companion notebook: day-12-modern-architectures.ipynb.

  1. Implement RoPE. Write build_rope_cache and apply_rope(x, positions) for a (B, H, T, d_head) tensor. Verify the relative-position property numerically: the q·k score at positions (m, n) must equal that at (m+s, n+s) for any shift s.
  2. RoPE frequency spectrum. Plot the rotation angle m·θ_i for several values of i as a function of position m. Observe the high-frequency / low-frequency structure (like a Fourier basis).
  3. Implement RMSNorm and SwiGLU as nn.Modules. Confirm that SwiGLU at hidden ≈8/3·D has about the same parameter count as a GELU FFN at 4D. Measure forward-pass time for both.
  4. Implement GQA. Generalize Day 9 attention to n_kv_heads < n_heads by repeating KV heads. Confirm it reduces to MHA when n_kv_heads == n_heads and MQA when == 1.
  5. KV-cache arithmetic. Write kv_cache_bytes(n_layers, n_kv_heads, d_head, seq_len, batch, dtype_bytes). Print a table for MHA vs GQA-8 vs MQA for a LLaMA-3 8B-shaped model at T=8K and T=128K, batch sizes 1 and 16.
  6. Build a toy MoE layer. Implement router + E expert FFNs with top-k routing. Print the expert usage fraction per batch and compute the load-balancing loss. Show that skewing router weights causes collapse; the lb loss penalizes this.
  7. Plot attention FLOPs vs sequence length. Compute and plot the number of multiply-accumulates for full attention at T ∈ [1K, 2K, 4K, 8K, 16K, 32K, 128K] against the linear SSM baseline.
  8. Upgrade your Day 9 GPT to LLaMA-mini. Swap in RMSNorm, RoPE, SwiGLU, and GQA. Retrain on TinyShakespeare and compare the loss curve to the GPT-2-style baseline. Report whether the architecture change helps, hurts, or is neutral at this tiny scale.
Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you cannot, re-read the relevant section.

  1. What property makes RoPE "relative", and to which tensors is it applied? Why is v left unrotated?
  2. What does increasing the RoPE base from 10000 to 500000 do to the rotation frequencies, and why does that help with long contexts?
  3. How does ALiBi encode relative position, and what is the inference advantage over learned absolute position embeddings?
  4. Rank MHA, GQA, MQA by KV-cache size. For H=32 query heads and G=8 KV groups, how much smaller is the GQA cache vs MHA?
  5. Write the formula for KV-cache bytes and compute it for LLaMA-3 8B (L=32, n_kv=8, d_head=128) at T=8192, B=16 in fp16.
  6. What does MLA compress, and how does it differ from GQA's approach to reducing the cache?
  7. Why does SwiGLU use a hidden dimension of ~8/3·D instead of 4D?
  8. In MoE, what is decoupled from what? For Mixtral 8×7B (E=8, k=2), roughly what are the total and active parameter counts?
  9. Why does MoE save compute but not memory, and what does that mean for multi-GPU serving?
  10. What is the load-balancing loss, why is it needed, and what goes wrong without it?

"Most 'modern architecture' is inference engineering wearing a training hat. GQA, RoPE's long-context extensions, and MoE exist because someone has to serve the thing — on hardware that exists today."

Day 12 · Modern architectures
Further Reading

Go deeper.

The papers and code behind each upgrade.

Paper · 2023

Touvron et al. — LLaMA

RMSNorm + RoPE + SwiGLU + GQA in one influential package. The canonical reference for the modern dense transformer recipe.

Open paper
Paper · 2021

Su et al. — RoFormer (RoPE)

The rotary position embedding, derived and motivated. Includes the proof that the attention score is a function of relative position only.

Open paper
Paper · 2023

Ainslie et al. — GQA

Grouped-query attention: the principled interpolation between MHA and MQA with ablations at T5 scale.

Open paper
Paper · 2019

Shazeer — Multi-Query Attention

The original "share one KV head" idea, motivated by fast autoregressive decoding.

Open paper
Paper · 2020

Shazeer — GLU Variants (SwiGLU)

"GLU Variants Improve Transformer." Where SwiGLU comes from — empirical ablation over gating functions.

Open paper
Paper · 2024

Jiang et al. — Mixtral of Experts

Sparse MoE at 8×7B: 47B total, 13B active. The canonical modern MoE reference.

Open paper
Paper · 2017

Shazeer et al. — Sparsely-Gated MoE

The original mixture-of-experts layer with top-k routing and the load-balancing loss.

Open paper
Paper · 2023

Gu & Dao — Mamba

Selective state-space model with hardware-efficient parallel scan. The leading SSM architecture alternative to Transformers.

Open paper
Blog · 2023

Peng et al. — YaRN: Long-Context Extension

NTK-aware and YaRN scaling of RoPE for long-context extension without full retraining.

Open paper
Repo · HF

modeling_llama.py

Production reference: RoPE + RMSNorm + SwiGLU + GQA in one readable file. Cross-reference with today's derivations.

View source
Paper · 2024

DeepSeek-V2 — MLA

Multi-Head Latent Attention: low-rank KV compression that reduces KV cache size far beyond GQA while recovering most of MHA quality.

Open paper