LLM Inference Engineer · Day 01
Day 01 · Week 1 · Foundations
📐

Math You Actually Need

You don't need to be a mathematician — you need to be fluent with a small set of operations. The way you're fluent with for-loops. Every symbol decoded the first time it appears.

Time~120 min
DifficultyGentle
PrerequisitePython comfort
Why This Lesson

A small set of operations, used everywhere.

Every modern LLM is built from a surprisingly small set of mathematical operations, repeated at enormous scale. You do not need a degree in mathematics to understand them. You need to be fluent with a handful of ideas — the way you are fluent with for-loops and dictionaries — so that when you read transformer code or a research paper, the symbols read as plainly as Python.

Today we build that fluency from the ground up. We decode every symbol the first time it appears, work each idea through a concrete numerical example before stating the general rule, and connect each piece directly to where it shows up inside an LLM. By the end you will have traced data all the way through a forward pass — from raw text to a probability distribution over the vocabulary — thinking in shapes the whole way.

This is the foundation the entire rest of Week 1 stands on. Take your time with it.

Learning objectives

  1. Read the standard math notation used in ML papers — sets, subscripts, sums, derivatives, and gradients — and translate each symbol to its Python equivalent.
  2. Compute dot products and matrix multiplications by hand, and predict the output shape of any matmul from its input shapes.
  3. Explain a derivative as a slope and apply the chain rule to a composed function — the rule that makes backpropagation possible.
  4. Turn a vector of logits into a probability distribution with softmax, including the max-subtraction trick for numerical stability.
  5. Derive cross-entropy loss from intuition and connect it to the six-step LLM training loop.
  6. Trace the shapes of an LLM forward pass end to end, from (B, T) token IDs to (B, T, V) logits.
Notation Cheatsheet

Decoding every symbol you'll see today.

If you've never read a math paper, the symbols can look intimidating. They aren't. Here's everything you'll see in this lesson.

Variable names — just like in code:

  • x, y, a, W, θ — names. Lowercase = number or vector. Uppercase = matrix. Greek letters = often parameters or angles.

Sets of numbers:

  • — "the reals." All real numbers — Python float. So 5.0, -3.14, π are in .
  • — natural numbers (0, 1, 2, ...). Python int (non-negative).
  • — all integers, positive and negative.

Set membership:

  • — "is in" / "belongs to." Same as Python's in.
  • x ∈ ℝ — "x is a real number."
  • x ∈ ℝ^d — "x is a vector of d real numbers."
  • W ∈ ℝ^{m×n} — "W is an m-by-n matrix."

Subscripts and big operators:

  • xᵢ ("x sub i") — the i-th entry of vector x.
  • Σ (sigma) — sum. Same as Python sum(...).
  • Π (pi) — product. Same as math.prod(...).

Calculus and probability:

  • df/dx — derivative of f w.r.t. x. The slope of f at x.
  • ∂f/∂x ("partial f, partial x") — partial derivative when f has multiple inputs.
  • ∇f ("del f" or "gradient of f") — vector of all partial derivatives.
  • 𝔼[X] — expected value (mean).
  • exp(x) = e^x, log(x) = natural log.
  • — approximately equals.

You don't need to memorize this. As we encounter each symbol below, it'll be re-explained in context.

The notation we use for matrix multiplication (with subscripts: A_ij) was introduced by Arthur Cayley in 1858 — over 100 years before computers existed.

Linear Algebra

Vectors, matrices, and dot products.

A vector is a 1-dimensional array of numbers. Like a Python list of floats:

x = [3.0, 1.0, 4.0]   # a vector with 3 entries

The number of entries is the vector's dimension. The example above has dimension 3, so x ∈ ℝ^3 — "x is a vector of 3 real numbers."

A matrix is a 2-dimensional array — a rectangular grid. Like a list of lists:

W = [[1, 2, 3],
     [4, 5, 6]]      # 2 rows, 3 columns

Shape (2, 3). Notation: W ∈ ℝ^{2×3}. In ML, weights are matrices.

Dot product. A single number measuring how "aligned" two vectors are. Multiply matching entries, sum them up.

Concrete example. Take a = [1, 2, 3], b = [4, 5, 6]:

a · b = (1 × 4) + (2 × 5) + (3 × 6) = 4 + 10 + 18 = 32

In Python: sum(a[i] * b[i] for i in range(len(a))). In math notation:

a · b = Σᵢ aᵢ bᵢ

Decoding: Σ = sum, i = index variable, aᵢ = i-th entry of a. So this reads "for every index i, multiply aᵢ × bᵢ and add them up." Exactly what we did.

Why we care: dot products measure similarity. Same direction → big positive. Opposite directions → big negative. Perpendicular → zero. Attention is essentially a giant pile of dot products.

The word "softmax" was coined because it's a soft version of argmax — instead of crisply picking the maximum, it gives a smooth probability distribution. Lower temperature recovers argmax; higher temperature approaches uniform.

Linear Algebra

Matrix multiplication, by hand.

This is the operation that powers every modern neural network.

Concrete example. Take two matrices:

A = [[1, 2],     B = [[5, 6],
     [3, 4]]         [7, 8]]

Both are shape (2, 2). The product C = A @ B (we use @ for matmul, same as Python/NumPy) is computed entry-by-entry. Each entry of C is a dot product of a row of A with a column of B.

C[0,0] = [1, 2] · [5, 7] = 5 + 14 = 19 C[0,1] = [1, 2] · [6, 8] = 6 + 16 = 22 C[1,0] = [3, 4] · [5, 7] = 15 + 28 = 43 C[1,1] = [3, 4] · [6, 8] = 18 + 32 = 50 C = [[19, 22], [43, 50]]

The general rule. If A is shape (m, k) and B is shape (k, n), then A @ B has shape (m, n). The k's must match — that's the dimension we sum along.

A @ B → C (m, k) (k, n) (m, n) ↑ ↑ these must match (the k cancels)

Note: matmul is NOT commutative. A @ BB @ A in general. The shapes wouldn't even let you do it. Order matters.

Why matmuls dominate ML. A linear layer is y = Wx + b. A transformer is hundreds of these stacked. GPUs are matmul machines — most of the silicon is dedicated to making matrix multiplication fast. An H100 can do nearly 1,000 trillion FP16 operations per second on matmul (the newer Blackwell B200 more than doubles that). Optimizing matmuls is, almost literally, optimizing ML.

Why this matters for inference. At serving time, inference is almost entirely matmul. For every generated token, the model executes a forward pass dominated by matrix multiplications against the weight matrices — typically consuming >90% of wall-clock time. There are two bottlenecks: compute (how fast you can do FLOPs) and memory bandwidth (how fast you can read the weight matrices from GPU HBM). For large models running batch size 1, bandwidth is the bottleneck — the weights are read once per token. Increasing batch size amortizes that cost and shifts back toward compute-bound. Everything in Weeks 3–4 (quantization, KV-cache, continuous batching, FlashAttention) is fundamentally about attacking one of these two bottlenecks.

The Greek letter Σ (sigma) for sum was first used by Leonhard Euler in 1755. Most of the math powering modern AI is older than the steam engine.

Calculus

A derivative is a slope. That's it.

A derivative answers one question: if I change the input a tiny bit, how much does the output change? It's the slope.

Concrete example. Take f(x) = x². At x = 3, f(3) = 9. What's the slope right there?

Just measure it:

x = 3.000 → f(x) = 9.000000 x = 3.001 → f(x) = 9.006001 Change in f: 0.006001 Change in x: 0.001 Slope ≈ 0.006001 / 0.001 = 6.001

The exact derivative is 6. The general rule: if f(x) = x², then f'(x) = 2x. At x = 3: 2 × 3 = 6. ✓

Notation (three ways, same thing):

  • df/dx — Leibniz way. Most common in ML papers.
  • f'(x) — Lagrange way. "f prime of x."
  • (d/dx)[f(x)] — operator way.

Common derivatives to recognize:

If f(x) =then f'(x) =
c (constant)0
x1
2x
xⁿn·xⁿ⁻¹
exp(x)exp(x) (its own derivative!)
log(x)1/x

Partial derivatives — when functions have multiple inputs. Notation: ∂f/∂x instead of df/dx. The "∂" is a stylized "d" called "partial."

For f(x, y) = x² + 3xy:

  • ∂f/∂x — treat y as a constant: 2x + 3y
  • ∂f/∂y — treat x as a constant: 3x

Gradient. Stack all the partials into one vector:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Read ∇f as "del f" or "gradient of f." Geometric meaning: it points uphill — the direction of steepest increase. Move opposite the gradient to go downhill — that's gradient descent in one line: params -= learning_rate * gradient.

The chain rule was first written down by Leibniz in 1676. It would take 300 years for it to become "the most important rule in machine learning" — but it was waiting there the whole time.

Calculus

The chain rule — by example, then abstract.

The chain rule tells you how to differentiate composed functions. It is the rule that makes backpropagation possible.

Concrete example first. Take y = (3x + 1)². What's dy/dx?

Method 1: expand. y = 9x² + 6x + 1, so dy/dx = 18x + 6 = 6(3x + 1).

Method 2: chain rule. Decompose into two simpler functions:

Let u = 3x + 1 (inner function) → du/dx = 3 Then y = u² (outer function) → dy/du = 2u Multiply: dy/dx = (dy/du) · (du/dx) = 2u · 3 = 6u = 6(3x + 1) ✓ (same as method 1)

The general rule:

If y = f(g(x)), then dy/dx = (df/dg) · (dg/dx)

For longer chains, derivatives just multiply along: dy/dx = (df/dg) · (dg/dh) · (dh/dx).

How this becomes backprop. A neural network is a long chain of composed functions:

input → linear1 → ReLU → linear2 → softmax → loss

Each arrow is a function. The whole network is one big composition. Backpropagation is just the chain rule applied to this composition, computed efficiently in reverse order. Day 3 we do it by hand. Day 4 we generalize. Today, just remember: every weight update in every neural network ever trained comes from applying the chain rule.

KL divergence is named after Solomon Kullback and Richard Leibler, who published it in 1951 as a measure of information loss in cryptography. It quietly became the bedrock of modern RLHF 70 years later.

Probability

Distributions and the softmax function.

A random variable is a quantity whose value is uncertain — like a die roll, or the next token an LLM produces.

A probability distribution is a list of how likely each value is. Two requirements: every probability is in [0, 1], and they all sum to 1.

Example: a fair die.

P(X=1) = 1/6 ≈ 0.167
P(X=2) = 1/6 ≈ 0.167
... (all six values)
                 ─────
            sum = 1.000 ✓

An LLM outputs a distribution like this every step — but with ~100,000 outcomes (one per vocabulary token) instead of 6.

Softmax turns any vector of real numbers (called logits) into a probability distribution.

Concrete example. Logits z = [2.0, 1.0, 0.0]:

Step 1 — exponentiate (makes everything positive): exp(2.0) ≈ 7.389 exp(1.0) ≈ 2.718 exp(0.0) = 1.000 Step 2 — divide by the total (so they sum to 1): total = 7.389 + 2.718 + 1.000 = 11.107 p[0] = 7.389 / 11.107 ≈ 0.665 p[1] = 2.718 / 11.107 ≈ 0.245 p[2] = 1.000 / 11.107 ≈ 0.090 ───── = 1.000 ✓

The formula:

softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)

The numerical stability trick. If z = [1000, 999, 998], computing exp(1000) overflows. Subtract the max — softmax is invariant to a shared constant:

softmax(z)ᵢ = exp(zᵢ - max(z)) / Σⱼ exp(zⱼ - max(z))

Now the largest exponent input becomes 0 (so exp(0) = 1) and nothing overflows. You'll see this trick again in FlashAttention on Day 21.

Loss

Why we need a loss function.

Before we look at cross-entropy, let's zoom out and remember what we're actually doing.

The big picture. An LLM is a function: given some text, predict what comes next. To learn that function, the LLM has to be trained. Training means: we show it examples ("the cat sat on the ___") with the correct answer ("mat"), and we adjust its parameters so it gets better at predicting the right answer.

But to "adjust parameters to get better," we first need to measure how wrong the model is right now. We need a number.

That number is called the loss.

  • High loss = model is very wrong. Adjust parameters to bring it down.
  • Low loss = model is doing well. Don't change much.

The loss is what we computed gradients of (using the chain rule we learned a few sections back), and what we minimize using gradient descent. Backprop and gradient descent — those tools we built up — exist exactly to drive this loss number toward zero.

So our question becomes: given a model's predicted probability distribution and the correct answer, how do we compute the loss?

That's what cross-entropy does. Let's build it from scratch — no formulas yet, just intuition.

Cross-Entropy

Building it from intuition.

Recall: softmax takes the model's logits and produces a probability distribution. So if the model is choosing among three possible next tokens, softmax might give us:

Q = [0.7, 0.2, 0.1]

(Q for "predicted distribution" — that's the convention.) Meaning: 70% chance of token 0, 20% of token 1, 10% of token 2.

Suppose the correct answer is token 0. How "wrong" was the model?

Let's reason about it intuitively:

  • If the model gave token 0 a probability of 1.0 → not wrong → loss should be 0.
  • If it gave token 0 a probability of 0.7 → mostly right → loss should be small.
  • If it gave token 0 a probability of 0.1 → mostly wrong → loss should be bigger.
  • If it gave token 0 a probability of 0.001 → confidently wrong → loss should be really big.

So we need a function loss(p) that:

  • Takes the predicted probability for the correct answer.
  • Outputs 0 when p = 1.
  • Grows as p shrinks toward 0.

Logarithm has exactly the shape we want. Look at how log(p) behaves:

log(1.0) = 0 log(0.7) ≈ -0.357 log(0.5) ≈ -0.693 log(0.1) ≈ -2.303 log(0.01) ≈ -4.605 log(0.001) ≈ -6.908

(Reminder: log here means natural log — log base e ≈ 2.718. Same as Python's math.log.)

log(1) = 0 exactly. As p heads toward 0, log(p) heads toward negative infinity. Almost what we want — except we need loss high when p is low, but log gives us negative numbers.

Easy fix: flip the sign. Negative log gives us:

-log(1.0) = 0 -log(0.7) ≈ 0.357 -log(0.1) ≈ 2.303 -log(0.001) ≈ 6.908

That's exactly the shape we want. Loss = 0 when the model is perfectly confident in the right answer. Loss grows as the model becomes less confident. Loss explodes when the model is confidently wrong.

That's cross-entropy in its simplest form:

loss = -log(probability the model assigned to the correct answer)

That's the whole core idea. Everything else is dressing.

Cross-Entropy

A worked example, walked slowly.

Let's compute cross-entropy for a real example.

Setup:

  • Vocabulary has 3 tokens (indices 0, 1, 2).
  • The correct answer is token 0.
  • The model predicts: Q = [0.7, 0.2, 0.1].

Step 1. Pick out the probability the model gave to the correct token.

Q[0] = 0.7

(If Q[0] looks unfamiliar: it's just Python-style indexing. The first entry of the list Q.)

Step 2. Take the negative log.

loss = -log(0.7) ≈ 0.357

Done. The loss is 0.357 nats.

(About "nats." When we use natural log, the unit of cross-entropy is called nats. With log base 2 it would be bits. PyTorch uses nats. Doesn't change anything qualitatively.)

The full spectrum of loss values. Here's what cross-entropy looks like across different model predictions, for the same correct answer (token 0):

Model predicts QQ[0]LossInterpretation
[1.0, 0.0, 0.0]1.00.000Perfect — fully confident, fully right
[0.9, 0.05, 0.05]0.90.105Very good
[0.7, 0.2, 0.1]0.70.357Decent
[0.5, 0.3, 0.2]0.50.693Genuinely uncertain
[0.33, 0.33, 0.34]0.331.099Random guessing — note log(3) ≈ 1.099
[0.1, 0.6, 0.3]0.12.303Confidently wrong
[0.001, 0.999, 0]0.0016.908Very confidently wrong

Key thresholds:

  • Perfect: loss = 0.
  • Random guesser over a vocabulary of size V: loss ≈ log(V). For 3 tokens, log(3) ≈ 1.099. For GPT-2 (V=50,257), log(50,257) ≈ 10.83.

You'll actually see this number when you start training a model from scratch. Initial loss should be near log(vocab_size) — the model is essentially guessing. Much higher → bug. Much lower → also probably bug. It's a sanity check you'll use constantly.

Cross-Entropy

The general formula and the training loop.

In ML papers and code, you'll see cross-entropy written like this:

H(P, Q) = -Σᵢ Pᵢ log Qᵢ

Now we can decode it carefully:

  • H(P, Q) — function name. H is conventional for entropy-related quantities (from Boltzmann's 19th-century H-theorem — long detour).
  • P — the true distribution. For classification, this is a one-hot vector — all zeros except a 1 at the correct index.
  • Q — the predicted distribution from the model.
  • Σᵢ — sum over index i. i runs over every entry of the distributions.
  • Pᵢ, log Qᵢ — i-th entries. log is natural log.

Reading: "for every index i, multiply Pᵢ by log Qᵢ, sum them all up, then negate."

Why it reduces to our simple form. When P is one-hot, only one entry of P is 1. Everything else gets multiplied by 0 and disappears. Only the term at the correct index c survives:

H(P, Q) = -Σᵢ Pᵢ log Qᵢ = -(0 · log Q₀ + ... + 1 · log Q_c + ... + 0 · log Q_{V-1}) = -log Q_c

Same as our simple -log(probability for the correct answer).

Connecting back to LLM training. Now we can connect every piece:

  1. The LLM emits logits — raw scores for each possible next token.
  2. Softmax turns logits into probabilities Q.
  3. We have a correct answer — token index c.
  4. Cross-entropy computes loss = -log(Q[c]).
  5. We compute gradients of this loss w.r.t. every parameter (using the chain rule we just learned).
  6. Gradient descent nudges parameters so Q[c] is bigger next time.

Repeat across billions of training examples → you get an LLM. The full training loop in 6 bullets.

KL divergence (briefly). Cross-entropy compares a true distribution to a predicted one. KL divergence compares two predicted distributions:

KL(P || Q) = Σᵢ Pᵢ log(Pᵢ / Qᵢ)

Always ≥ 0; zero when P and Q are identical. Algebraic identity: KL(P||Q) = H(P,Q) - H(P). With fixed P, minimizing cross-entropy is equivalent to minimizing KL.

Where KL shows up: RLHF and DPO (Day 14). Used to keep a fine-tuned model close to its base — math for "stay close to where you started."

Tokenization

What's a tokenizer?

The core problem. Every operation we covered today — matmuls, softmax, cross-entropy — operates on numbers. But LLM input is text. There's no way to multiply a string by a matrix. So before any math happens, text gets converted into a sequence of integers. The thing that does the conversion is the tokenizer.

INPUT · TEXT a Python str "The cat sat on the mat" TOKENIZER encode str → int[] OUTPUT · ids a list of ints (length 6) [464, 3797, 3332, 319, 262, 2603] Text in, integers out. Decoding runs the inverse: integers back to text.

Three approaches, one winner. You could split text by character (tiny vocab, but sequences become very long — and attention is quadratic in sequence length). You could split by word (short sequences, but vocab explodes and rare words have no IDs). Modern LLMs split by subword using BPE (Byte Pair Encoding) — common words get one ID, rare words get split into chunks. Best of both. Day 5 builds BPE from scratch.

Vocabulary size = V. The fixed list of tokens a tokenizer knows. Same V you'll see below in (B, T, V) logits.

ModelVocab size V
GPT-250,257
GPT-3.5 / GPT-4 (cl100k_base)100,277
GPT-4o (o200k_base)200,019
LLaMA 232,000
LLaMA 3128,256
Mistral / Mixtral32,000
Gemma256,000

Bigger V = each token carries more meaning, but the embedding table (V, D) and LM head (D, V) grow too. Trade-off, set once at training time.

Properties to remember:

  • Deterministic + reversible. decode(encode(x)) == x.
  • Trained, not hand-coded. Learned from a corpus before LLM training begins.
  • Frozen with the model. Can't swap tokenizers on a trained model — the embedding table is indexed by its tokenizer's IDs.
  • Rule of thumb. For English, 1 token ≈ ¾ word ≈ 4 chars. So 1,000 tokens ≈ 750 words ≈ 1.5 pages.
Takeaway: a tokenizer is the deterministic bridge from text to integers. It defines V, and it's the first thing that runs — before any math. Day 5 builds one.

GPT-3.5 once confidently said "9.11 > 9.9" — partly a tokenization artifact. "9.9" and "9.11" tokenize in ways that hide their numeric structure. Karpathy's minbpe implements BPE in ~300 lines of Python; Day 5 walks through it.

Embeddings

What's an embedding?

The problem with raw integer IDs. The tokenizer hands you a list of integers like [464, 3797, 3332]. But integer 5379 has no useful "geometric" relationship to integer 5380 — they're just different labels. Neural networks need vectors of floats (so they can dot-product, add, multiply by matrices) and they need similar tokens to be near each other in space (so the model can learn that "cat" and "dog" play similar roles). Integers carry neither.

The solution: a lookup table. The model owns a learned matrix called the embedding table, written E, of shape (V, D):

  • V = vocabulary size (from the tokenizer).
  • D = hidden dimension (a hyperparameter — GPT-2 uses 768, LLaMA-7B uses 4096).
  • Row i of E is the vector for token i — a D-dimensional point in space.
INPUT · ids shape (T,) 9906 11 1917 LOOKUP · E[ids] E.shape = (V, D) row 0 row 1 row 11 row 1917 row 9906 ← D entries per row → OUTPUT · x shape (T, D) [ 0.31, -0.40, 0.70, 0.05, … ] [-0.12, 0.88, -0.21, 0.66, … ] [ 0.55, 0.04, -0.93, 0.18, … ] 3 rows · each is a D-dim vector of floats

Embedding lookup: integer ID i picks row i of E. (T,) ints become (T, D) floats. For a batch, (B, T)(B, T, D).

Embedding a sequence of token IDs is just fancy indexing:

ids = tokenizer.encode("Hello, world!")    # shape (T,)   e.g. [9906, 11, 1917, 0]
x   = E[ids]                                # shape (T, D)

For a batch, the same lookup turns (B, T) integer IDs into (B, T, D) vectors — and (B, T, D) is the shape of the "residual stream" that flows through every transformer block. This is the only place text-as-integers becomes vectors-of-floats. From here on, everything is matmul.

In PyTorch it's one line:

import torch.nn as nn
embed = nn.Embedding(V, D)    # creates a (V, D) learnable table
x = embed(ids)                # shape (B, T, D)

Embeddings are learned, not designed. When training starts, E is filled with random numbers. With each gradient-descent step, the rows for tokens that appeared in the training batch get nudged. After billions of steps, semantically related tokens end up clustered together in D-dimensional space, and the geometric structure of E carries real meaning. You don't write the embeddings — gradient descent does.

Tying it back to V and D. The embedding table has V × D parameters. For LLaMA-7B (V=32,000, D=4096), that's 131M parameters in the embedding table alone — meaningful but a small slice of the 7B total. Many models share the same matrix between the embedding (input) and the LM head (output) — called weight tying — to save those parameters.

Takeaway: an embedding is a learned D-dim vector for each vocab token. The table E of shape (V, D) is the bridge from token IDs to vectors — the step that turns (B, T) ints into (B, T, D) floats so the rest of the model can do math on them. Day 5 looks at it in detail.
Mental Model

What this last section is about.

You've now learned vectors, matrices, dot products, matrix multiplication, derivatives, the chain rule, softmax, and cross-entropy. You have all the math you need.

Before we close out Day 1, let's do something practical with it: trace data flowing through a real LLM, end-to-end.

The objective. By the end of this section, you should be able to say what shape the data has at every point in an LLM forward pass — from raw text input all the way to a probability distribution over the vocabulary. You should be able to predict, when you see print(x.shape), whether the printed shape makes sense.

Why this matters. When you're reading transformer code (or debugging your own), the single most useful thing you can do is track shapes. Most ML bugs are shape mismatches. Senior ML engineers think in shapes the way senior backend engineers think in HTTP status codes — automatically.

A note before we begin. This section will use words like embedding, transformer block, and LM head that we haven't fully explained yet. That's OK. You don't need to understand what each operation does yet. You just need to understand what each operation does to the shape. By Day 7 every word here will be fully explained. Today, you're installing the shape skeleton — a mental scaffolding the rest of Week 1 will fill in.

The four shape variables you'll see constantly:

LetterMeaningTypical value
BBatch — sequences in flight at once1 to 256
TSequence length, in tokens128 to 8192+
DHidden dimension (vector size per token)512 to 16384
VVocabulary size32k to 128k
LNumber of Layers stacked12 to 80+

A token vector lives in ℝ^D. A whole batch of sequences lives in ℝ^{B×T×D} — three dimensions: which sequence, which token in the sequence, which entry of the token's vector.

A 3D tensor of shape (B=2, T=5, D=8) Two stacked "pages." Each page is a sequence; each row is a token; each row has D entries. token 0 token 1 token 2 token 3 token 4 Page 0 · "Hello world! This is fun" (behind it: Page 1 · "The cat sat on mat") B = 2 (pages, depth) T = 5 tokens / rows D = 8 entries (per token row)

A (B, T, D) tensor: B sequences, each with T tokens, each token represented by a D-dim vector.

Mental Model

The forward pass, step by step.

We'll use small numbers so you can hold everything in your head: B=2, T=5, D=8, V=100. (Real models are 100× bigger, but the structure is identical.)

Imagine two short prompts:

Prompt 1: "Hello world! This is fun"   (5 tokens)
Prompt 2: "The cat sat on mat"           (5 tokens)

Step 0 — Tokenize. Each prompt becomes a list of integer IDs.

INPUT · TEXT Prompt 1 · 5 tokens "Hello world! This is fun" Prompt 2 · 5 tokens "The cat sat on mat" TOKENIZER BPE encode str → int[] OUTPUT · ids shape (B, T) = (2, 5) 15496 11 995 1212 318 25699 22 887 8888 4444 2 sequences × 5 tokens · just integers (no vector dim yet)
ids[0] = [15496, 11, 995, 1212, 318]
ids[1] = [25699, 22, 887, 8888, 4444]
ids.shape = (2, 5)    # (B, T)

Step 1 — Embedding lookup. The model has an embedding table E of shape (V, D). Each row of E is a learned vector for one token.

INPUT · ids shape (2, 5) 15496 11 995 1212 318 25699 22 887 8888 4444 EMBEDDING LOOKUP x = E[ids] E.shape = (V, D) = (100, 8) OUTPUT · x shape (B, T, D) = (2, 5, 8) 3D! gained the D dimension each integer became a learned 8-dim vector
E.shape = (100, 8)
x = E[ids]
x.shape = (2, 5, 8)    # (B, T, D) — gained a dim!

Now every integer ID has been replaced by an 8-dim vector. This is the only "lookup" in the forward pass — from here on, everything is matmul.

Step 2 — Add positional encoding. A position-dependent vector is added so the model knows token order.

INPUT · x shape (2, 5, 8) + POSITIONAL ENC. P · shape (T, D) = (5, 8) = OUTPUT · x shape (2, 5, 8) — UNCHANGED Same shape as input. We added position info into each token vector via broadcasting.
x.shape = (2, 5, 8)    # unchanged

Step 3 — L transformer blocks. Each block reads x, does attention + feedforward, returns a tensor of the same shape.

INPUT · x shape (2, 5, 8) L × TRANSFORMER BLOCKS attention + feedforward L = 12 (GPT-2) · 32 (LLaMA-7B) · ~120 (GPT-4) repeat L times OUTPUT · x shape (2, 5, 8) Shape preserved at every block. Contents are transformed; the data's "skeleton" stays the same.
for each of L blocks:
    x = block(x)
    # x.shape = (2, 5, 8) at every step — shape never changes!

Examples: GPT-2 small has L=12. LLaMA-7B has L=32. GPT-4 reportedly L≈120.

Step 4 — Final layer norm.

INPUT · x shape (2, 5, 8) FINAL LAYER NORM rescale per-token vector magnitude OUTPUT · x shape (2, 5, 8) Shape preserved. Each token vector is normalized to controlled magnitude.
x.shape = (2, 5, 8)    # still unchanged

Step 5 — LM head: project to vocab logits. A single matmul converts each D-dim hidden vector into V scores.

INPUT · x shape (B, T, D) = (2, 5, 8) @ LM HEAD WEIGHTS W_lm · shape (D, V) = (8, 100) matmul = OUTPUT · logits shape (B, T, V) = (2, 5, 100) D cancels in the matmul; last axis becomes V. Now the box is much wider! For each token, we now have a score for every word in the vocabulary.
W_lm.shape = (8, 100)    # (D, V)
logits = x @ W_lm         # the matmul we just learned!
logits.shape = (2, 5, 100)    # (B, T, V)

(Matmul intuition #2 from earlier: same transformation applied to every token row.)

The LM head matmul: where the shape finally changes The inner D dimension cancels; B and T are preserved on the output. x (B, T, D) = (2, 5, 8) @ W_lm (D, V) = (8, 100) = logits (B, T, V) = (2, 5, 100) D appears as x's last axis AND W_lm's first axis — those two cancel. B and T pass through unchanged. Last axis becomes V.

Shape arithmetic for the LM head: (B, T, D) @ (D, V) → (B, T, V). Same matmul rule we computed by hand earlier.

Step 6 — Softmax over the last axis.

INPUT · logits shape (2, 5, 100) raw scores · any real number e.g. [3.1, -0.7, 2.8, ...] SOFTMAX over the last axis (vocabulary) exp(z) / sum(exp(z)) OUTPUT · probs shape (2, 5, 100) — same shape probabilities · every row sums to 1 e.g. [0.51, 0.12, 0.31, ...]
probs = softmax(logits, axis=-1)
probs.shape = (2, 5, 100)

Each probs[b, t, :] is now a probability distribution over the vocabulary — telling us how likely each token is to come next.

Step 7 — Pick the next token (during generation).

INPUT · probs shape (2, 5, 100) last pos ARGMAX (LAST POS) probs[:, -1, :].argmax(-1) pick the most likely vocab index OUTPUT · next_token shape (B,) = (2,) 42 87 two new token IDs Append → run the whole pipeline again with the new token included → autoregressive generation.
next_token_probs = probs[:, -1, :]    # last position, shape (B, V)
next_token = next_token_probs.argmax(-1)    # shape (B,)

Picked token gets appended to the input → run again → autoregressive generation.

That's the whole forward pass. Every modern LLM — GPT-4, Claude, LLaMA, Mistral, Gemini, DeepSeek — follows this same skeleton. Different sizes, different details inside the blocks, same shape walk.

The diagram below combines all eight stations into a single flow. Outlined boxes mark the only two places where the shape actually changes (Steps 2 and 6). Everything else is shape-preserving.

STEP 0 · INPUT "Hello world! This is fun" tokenize STEP 1 · TOKEN IDs ids shape: (B, T) = (2, 5) — 2D · integer token IDs, no vector dimension yet embedding lookup E[ids] STEP 2 · EMBEDDED x shape: (B, T, D) = (2, 5, 8) — 3D! gained the D dimension + positional encoding STEP 3 · WITH POSITION INFO x · shape (B, T, D) = (2, 5, 8) — unchanged × L transformer blocks STEP 4 · AFTER L BLOCKS x · shape (B, T, D) = (2, 5, 8) contents transformed; shape preserved at every layer final layer norm STEP 5 · NORMALIZED x · shape (B, T, D) = (2, 5, 8) @ W_lm (D, V matmul) STEP 6 · LOGITS logits shape: (B, T, V) = (2, 5, 100) — shape changed! D → V on the last axis softmax (over last axis) STEP 7 · PROBABILITIES probs shape: (B, T, V) · every row sums to 1 argmax / sample · last position only STEP 8 · NEXT TOKEN next_token shape: (B,) = (2,) · one new token per sequence Outlined boxes = shape changes here. Filled cream boxes = shape preserved. Only Steps 2 (embed) and 6 (LM head) actually change the data's shape. Everything else is shape-preserving.

The full Day-1 forward pass. Two shape transitions: (B,T) → (B,T,D) at embedding, and (B,T,D) → (B,T,V) at the LM head. Everything in between is shape-preserving.

The three "big" shapes to watch:

  • (B, T, D) — the "residual stream." Almost every intermediate tensor has this shape. D is the model's "channel count."
  • (D, V) — the LM head matrix. Maps hidden state → vocab logits. For LLaMA-7B (D=4096, V=32000), 131M parameters in the LM head alone.
  • (B, h, T, T)attention scores (Day 6, h = number of heads). The T × T means every token has a similarity score with every other token. Quadratic in T — what FlashAttention attacks on Day 21.

Connecting back to loss. Now we can locate where cross-entropy fits in the pipeline:

ids → embeddings → blocks → logits → softmax → probs → cross-entropy │ ▼ loss

During training: every position has a "correct next token" — cross-entropy computes loss for each, average them, backprop, update parameters.

During inference: no correct answer — argmax (or sample) the logits at the last position, append, repeat.

Same model. Same forward pass. Different post-processing.

Debugging mantra: 80% of ML bugs are shape mismatches. Add print(x.shape) liberally.

Exercise

Compute it by hand, then in code.

Companion notebook: day-1-math.ipynb. Type the code; don't copy-paste. The point is to feel the numbers and the shapes.

  1. Dot product and matmul by hand. Compute [1, 2, 3] · [4, 5, 6] on paper, then verify with NumPy. Multiply A = [[1, 2], [3, 4]] by B = [[5, 6], [7, 8]] entry by entry, and confirm that A @ B ≠ B @ A.
  2. Derivative as a slope. For f(x) = x², estimate the slope at x = 3 by finite difference ((f(3.001) − f(3)) / 0.001) and check it approaches the exact value 2x = 6.
  3. Chain rule. For y = (3x + 1)², compute dy/dx two ways — expand first, then via the chain rule — and confirm they agree at x = 2 (you should get 42).
  4. Softmax from scratch. Implement softmax([2.0, 1.0, 0.0]) and confirm the output sums to 1. Then feed it [1000, 999, 998], watch it overflow, and fix it with the max-subtraction trick.
  5. Cross-entropy. For the prediction Q = [0.7, 0.2, 0.1] with correct answer token 0, compute −log(Q[0]). Then build the full table of loss values from the lesson and confirm a random guesser over V tokens scores about log(V).
  6. Trace the shapes. Using B=2, T=5, D=8, V=100, write out the shape after every station of the forward pass — from (B, T) token IDs to (B, T, V) logits — and mark the only two steps where the shape changes.
Self-Check

Nine questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

  1. In x ∈ ℝ^d, what does each of x, , , and ^d mean? Translate each symbol to Python.
  2. Why do we subtract the max in softmax? What goes wrong without it, and why does subtracting max(z) not change the output?
  3. Why is matrix multiplication not commutative? Give a shape-based argument — when would A @ B and B @ A both be legal, and do they give the same result?
  4. Compute by hand: softmax([1, 0, -1]). Sanity check that the answer sums to 1. (Hint: exp(1) ≈ 2.718, exp(0) = 1, exp(-1) ≈ 0.368.)
  5. For an LLM with input shape (B, T), walk through how it becomes (B, T, V) logits. At which two steps does the shape change, and what are the input and output shapes at each?
  6. Why is character-level tokenization expensive at inference time? (Hint: which tensor in the forward pass is quadratic in T? What does that mean for a 10× longer sequence?)
  7. Why can't an LLM operate directly on raw integer token IDs? What does the embedding table give you that integers don't, what shape is it, and what shape comes out of the lookup?
  8. A model generates tokens one at a time. At each step it runs the full forward pass and picks probs[:, -1, :].argmax(-1). Why the last position? Why not position 0?
  9. An inference engineer says "this model is memory-bandwidth-bound at batch size 1." What does that mean in terms of the matmul operations happening at each token step, and what would you do to shift it toward compute-bound?

"You don't need to be a mathematician. You need to be fluent — the way you're fluent with for-loops."

A core principle of this curriculum
Further Reading

Go deeper.

Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.

YouTube · Free · ~3 hrs

3Blue1Brown — Essence of Linear Algebra

The single best linear algebra resource ever made. Episodes 1-7 are essential.

Watch series
YouTube · Free · ~2.5 hrs

3Blue1Brown — Essence of Calculus

Beautiful animations. Episodes 1-4 cover everything we need for derivatives and chain rule.

Watch series
Free PDF · Reference

Mathematics for Machine Learning

Deisenroth, Faisal, Ong. Chapters 2 (Linear Algebra), 5 (Calculus), 6 (Probability) cover everything we use.

Download PDF
Free book · Reference

Goodfellow et al. — Deep Learning

Chapters 2-3 are a concise math primer for neural networks.

Read online
Stanford · Cheatsheet

CS229 Linear Algebra Review

Andrew Ng's course handout. 23 dense pages. Worth bookmarking.

Open PDF
Reference · Math

The Matrix Cookbook

Petersen & Pedersen. Comprehensive reference for derivatives of matrix expressions.

Open PDF