Day 01 — Math You Actually Need · LLM Inference Engineer Curriculum

Why This Lesson

A small set of operations, used everywhere.

Every modern LLM is built from a surprisingly small set of mathematical operations, repeated at enormous scale. You do not need a degree in mathematics to understand them. You need to be fluent with a handful of ideas — the way you are fluent with for-loops and dictionaries — so that when you read transformer code or a research paper, the symbols read as plainly as Python.

Today we build that fluency from the ground up. We decode every symbol the first time it appears, work each idea through a concrete numerical example before stating the general rule, and connect each piece directly to where it shows up inside an LLM. By the end you will have traced data all the way through a forward pass — from raw text to a probability distribution over the vocabulary — thinking in shapes the whole way.

This is the foundation the entire rest of Week 1 stands on. Take your time with it.

Learning objectives

Read the standard math notation used in ML papers — sets, subscripts, sums, derivatives, and gradients — and translate each symbol to its Python equivalent.
Compute dot products and matrix multiplications by hand, and predict the output shape of any matmul from its input shapes.
Explain a derivative as a slope and apply the chain rule to a composed function — the rule that makes backpropagation possible.
Turn a vector of logits into a probability distribution with softmax, including the max-subtraction trick for numerical stability.
Derive cross-entropy loss from intuition and connect it to the six-step LLM training loop.
Trace the shapes of an LLM forward pass end to end, from (B, T) token IDs to (B, T, V) logits.

Notation Cheatsheet

Decoding every symbol you'll see today.

If you've never read a math paper, the symbols can look intimidating. They aren't. Here's everything you'll see in this lesson.

Variable names — just like in code:

x, y, a, W, θ — names. Lowercase = number or vector. Uppercase = matrix. Greek letters = often parameters or angles.

Sets of numbers:

ℝ — "the reals." All real numbers — Python float. So 5.0, -3.14, π are in ℝ.
ℕ — natural numbers (0, 1, 2, ...). Python int (non-negative).
ℤ — all integers, positive and negative.

Set membership:

∈ — "is in" / "belongs to." Same as Python's in.
x ∈ ℝ — "x is a real number."
x ∈ ℝ^d — "x is a vector of d real numbers."
W ∈ ℝ^{m×n} — "W is an m-by-n matrix."

Subscripts and big operators:

xᵢ ("x sub i") — the i-th entry of vector x.
Σ (sigma) — sum. Same as Python sum(...).
Π (pi) — product. Same as math.prod(...).

Calculus and probability:

df/dx — derivative of f w.r.t. x. The slope of f at x.
∂f/∂x ("partial f, partial x") — partial derivative when f has multiple inputs.
∇f ("del f" or "gradient of f") — vector of all partial derivatives.
𝔼[X] — expected value (mean).
exp(x) = e^x, log(x) = natural log.
≈ — approximately equals.

You don't need to memorize this. As we encounter each symbol below, it'll be re-explained in context.

Linear Algebra

Vectors, matrices, and dot products.

A vector is a 1-dimensional array of numbers. Like a Python list of floats:

x = [3.0, 1.0, 4.0]   # a vector with 3 entries

The number of entries is the vector's dimension. The example above has dimension 3, so x ∈ ℝ^3 — "x is a vector of 3 real numbers."

A matrix is a 2-dimensional array — a rectangular grid. Like a list of lists:

W = [[1, 2, 3],
     [4, 5, 6]]      # 2 rows, 3 columns

Shape (2, 3). Notation: W ∈ ℝ^{2×3}. In ML, weights are matrices.

Dot product. A single number measuring how "aligned" two vectors are. Multiply matching entries, sum them up.

Concrete example. Take a = [1, 2, 3], b = [4, 5, 6]:

a · b = (1 × 4) + (2 × 5) + (3 × 6) = 4 + 10 + 18 = 32

In Python: sum(a[i] * b[i] for i in range(len(a))). In math notation:

a · b = Σᵢ aᵢ bᵢ

Decoding: Σ = sum, i = index variable, aᵢ = i-th entry of a. So this reads "for every index i, multiply aᵢ × bᵢ and add them up." Exactly what we did.

Why we care: dot products measure similarity. Same direction → big positive. Opposite directions → big negative. Perpendicular → zero. Attention is essentially a giant pile of dot products.

Linear Algebra

Matrix multiplication, by hand.

This is the operation that powers every modern neural network.

Concrete example. Take two matrices:

A = [[1, 2],     B = [[5, 6],
     [3, 4]]         [7, 8]]

Both are shape (2, 2). The product C = A @ B (we use @ for matmul, same as Python/NumPy) is computed entry-by-entry. Each entry of C is a dot product of a row of A with a column of B.

C[0,0] = [1, 2] · [5, 7] = 5 + 14 = 19 C[0,1] = [1, 2] · [6, 8] = 6 + 16 = 22 C[1,0] = [3, 4] · [5, 7] = 15 + 28 = 43 C[1,1] = [3, 4] · [6, 8] = 18 + 32 = 50 C = [[19, 22], [43, 50]]

The general rule. If A is shape (m, k) and B is shape (k, n), then A @ B has shape (m, n). The k's must match — that's the dimension we sum along.

A @ B → C (m, k) (k, n) (m, n) ↑ ↑ these must match (the k cancels)

Note: matmul is NOT commutative. A @ B ≠ B @ A in general. The shapes wouldn't even let you do it. Order matters.

Why matmuls dominate ML. A linear layer is y = Wx + b. A transformer is hundreds of these stacked. GPUs are matmul machines — most of the silicon is dedicated to making matrix multiplication fast. An H100 can do nearly 1,000 trillion FP16 operations per second on matmul (the newer Blackwell B200 more than doubles that). Optimizing matmuls is, almost literally, optimizing ML.

Why this matters for inference. At serving time, inference is almost entirely matmul. For every generated token, the model executes a forward pass dominated by matrix multiplications against the weight matrices — typically consuming >90% of wall-clock time. There are two bottlenecks: compute (how fast you can do FLOPs) and memory bandwidth (how fast you can read the weight matrices from GPU HBM). For large models running batch size 1, bandwidth is the bottleneck — the weights are read once per token. Increasing batch size amortizes that cost and shifts back toward compute-bound. Everything in Weeks 3–4 (quantization, KV-cache, continuous batching, FlashAttention) is fundamentally about attacking one of these two bottlenecks.

Calculus

A derivative is a slope. That's it.

A derivative answers one question: if I change the input a tiny bit, how much does the output change? It's the slope.

Concrete example. Take f(x) = x². At x = 3, f(3) = 9. What's the slope right there?

Just measure it:

x = 3.000 → f(x) = 9.000000 x = 3.001 → f(x) = 9.006001 Change in f: 0.006001 Change in x: 0.001 Slope ≈ 0.006001 / 0.001 = 6.001

The exact derivative is 6. The general rule: if f(x) = x², then f'(x) = 2x. At x = 3: 2 × 3 = 6. ✓

Notation (three ways, same thing):

df/dx — Leibniz way. Most common in ML papers.
f'(x) — Lagrange way. "f prime of x."
(d/dx)[f(x)] — operator way.

Common derivatives to recognize:

If f(x) =	then f'(x) =
c (constant)	0
x	1
x²	2x
xⁿ	n·xⁿ⁻¹
exp(x)	exp(x) (its own derivative!)
log(x)	1/x

Partial derivatives — when functions have multiple inputs. Notation: ∂f/∂x instead of df/dx. The "∂" is a stylized "d" called "partial."

For f(x, y) = x² + 3xy:

∂f/∂x — treat y as a constant: 2x + 3y
∂f/∂y — treat x as a constant: 3x

Gradient. Stack all the partials into one vector:

∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Read ∇f as "del f" or "gradient of f." Geometric meaning: it points uphill — the direction of steepest increase. Move opposite the gradient to go downhill — that's gradient descent in one line: params -= learning_rate * gradient.

Calculus

The chain rule — by example, then abstract.

The chain rule tells you how to differentiate composed functions. It is the rule that makes backpropagation possible.

Concrete example first. Take y = (3x + 1)². What's dy/dx?

Method 1: expand. y = 9x² + 6x + 1, so dy/dx = 18x + 6 = 6(3x + 1).

Method 2: chain rule. Decompose into two simpler functions:

Let u = 3x + 1 (inner function) → du/dx = 3 Then y = u² (outer function) → dy/du = 2u Multiply: dy/dx = (dy/du) · (du/dx) = 2u · 3 = 6u = 6(3x + 1) ✓ (same as method 1)

The general rule:

If y = f(g(x)), then dy/dx = (df/dg) · (dg/dx)

For longer chains, derivatives just multiply along: dy/dx = (df/dg) · (dg/dh) · (dh/dx).

How this becomes backprop. A neural network is a long chain of composed functions:

input → linear1 → ReLU → linear2 → softmax → loss

Each arrow is a function. The whole network is one big composition. Backpropagation is just the chain rule applied to this composition, computed efficiently in reverse order. Day 3 we do it by hand. Day 4 we generalize. Today, just remember: every weight update in every neural network ever trained comes from applying the chain rule.

Probability

Distributions and the softmax function.

A random variable is a quantity whose value is uncertain — like a die roll, or the next token an LLM produces.

A probability distribution is a list of how likely each value is. Two requirements: every probability is in [0, 1], and they all sum to 1.

Example: a fair die.

P(X=1) = 1/6 ≈ 0.167
P(X=2) = 1/6 ≈ 0.167
... (all six values)
                 ─────
            sum = 1.000 ✓

An LLM outputs a distribution like this every step — but with ~100,000 outcomes (one per vocabulary token) instead of 6.

Softmax turns any vector of real numbers (called logits) into a probability distribution.

Concrete example. Logits z = [2.0, 1.0, 0.0]:

Step 1 — exponentiate (makes everything positive): exp(2.0) ≈ 7.389 exp(1.0) ≈ 2.718 exp(0.0) = 1.000 Step 2 — divide by the total (so they sum to 1): total = 7.389 + 2.718 + 1.000 = 11.107 p[0] = 7.389 / 11.107 ≈ 0.665 p[1] = 2.718 / 11.107 ≈ 0.245 p[2] = 1.000 / 11.107 ≈ 0.090 ───── = 1.000 ✓

The formula:

softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ)

The numerical stability trick. If z = [1000, 999, 998], computing exp(1000) overflows. Subtract the max — softmax is invariant to a shared constant:

softmax(z)ᵢ = exp(zᵢ - max(z)) / Σⱼ exp(zⱼ - max(z))

Now the largest exponent input becomes 0 (so exp(0) = 1) and nothing overflows. You'll see this trick again in FlashAttention on Day 21.

Cross-Entropy

Building it from intuition.

Recall: softmax takes the model's logits and produces a probability distribution. So if the model is choosing among three possible next tokens, softmax might give us:

Q = [0.7, 0.2, 0.1]

(Q for "predicted distribution" — that's the convention.) Meaning: 70% chance of token 0, 20% of token 1, 10% of token 2.

Suppose the correct answer is token 0. How "wrong" was the model?

Let's reason about it intuitively:

If the model gave token 0 a probability of 1.0 → not wrong → loss should be 0.
If it gave token 0 a probability of 0.7 → mostly right → loss should be small.
If it gave token 0 a probability of 0.1 → mostly wrong → loss should be bigger.
If it gave token 0 a probability of 0.001 → confidently wrong → loss should be really big.

So we need a function loss(p) that:

Takes the predicted probability for the correct answer.
Outputs 0 when p = 1.
Grows as p shrinks toward 0.

Logarithm has exactly the shape we want. Look at how log(p) behaves:

log(1.0) = 0 log(0.7) ≈ -0.357 log(0.5) ≈ -0.693 log(0.1) ≈ -2.303 log(0.01) ≈ -4.605 log(0.001) ≈ -6.908

(Reminder: log here means natural log — log base e ≈ 2.718. Same as Python's math.log.)

log(1) = 0 exactly. As p heads toward 0, log(p) heads toward negative infinity. Almost what we want — except we need loss high when p is low, but log gives us negative numbers.

Easy fix: flip the sign. Negative log gives us:

-log(1.0) = 0 -log(0.7) ≈ 0.357 -log(0.1) ≈ 2.303 -log(0.001) ≈ 6.908

That's exactly the shape we want. Loss = 0 when the model is perfectly confident in the right answer. Loss grows as the model becomes less confident. Loss explodes when the model is confidently wrong.

That's cross-entropy in its simplest form:

loss = -log(probability the model assigned to the correct answer)

That's the whole core idea. Everything else is dressing.

Cross-Entropy

A worked example, walked slowly.

Let's compute cross-entropy for a real example.

Setup:

Vocabulary has 3 tokens (indices 0, 1, 2).
The correct answer is token 0.
The model predicts: Q = [0.7, 0.2, 0.1].

Step 1. Pick out the probability the model gave to the correct token.

Q[0] = 0.7

(If Q[0] looks unfamiliar: it's just Python-style indexing. The first entry of the list Q.)

Step 2. Take the negative log.

loss = -log(0.7) ≈ 0.357

Done. The loss is 0.357 nats.

(About "nats." When we use natural log, the unit of cross-entropy is called nats. With log base 2 it would be bits. PyTorch uses nats. Doesn't change anything qualitatively.)

The full spectrum of loss values. Here's what cross-entropy looks like across different model predictions, for the same correct answer (token 0):

Model predicts `Q`	`Q[0]`	Loss	Interpretation
`[1.0, 0.0, 0.0]`	1.0	0.000	Perfect — fully confident, fully right
`[0.9, 0.05, 0.05]`	0.9	0.105	Very good
`[0.7, 0.2, 0.1]`	0.7	0.357	Decent
`[0.5, 0.3, 0.2]`	0.5	0.693	Genuinely uncertain
`[0.33, 0.33, 0.34]`	0.33	1.099	Random guessing — note `log(3) ≈ 1.099`
`[0.1, 0.6, 0.3]`	0.1	2.303	Confidently wrong
`[0.001, 0.999, 0]`	0.001	6.908	Very confidently wrong

Key thresholds:

Perfect: loss = 0.
Random guesser over a vocabulary of size V: loss ≈ log(V). For 3 tokens, log(3) ≈ 1.099. For GPT-2 (V=50,257), log(50,257) ≈ 10.83.

You'll actually see this number when you start training a model from scratch. Initial loss should be near log(vocab_size) — the model is essentially guessing. Much higher → bug. Much lower → also probably bug. It's a sanity check you'll use constantly.

Cross-Entropy

The general formula and the training loop.

In ML papers and code, you'll see cross-entropy written like this:

H(P, Q) = -Σᵢ Pᵢ log Qᵢ

Now we can decode it carefully:

H(P, Q) — function name. H is conventional for entropy-related quantities (from Boltzmann's 19th-century H-theorem — long detour).
P — the true distribution. For classification, this is a one-hot vector — all zeros except a 1 at the correct index.
Q — the predicted distribution from the model.
Σᵢ — sum over index i. i runs over every entry of the distributions.
Pᵢ, log Qᵢ — i-th entries. log is natural log.

Reading: "for every index i, multiply Pᵢ by log Qᵢ, sum them all up, then negate."

Why it reduces to our simple form. When P is one-hot, only one entry of P is 1. Everything else gets multiplied by 0 and disappears. Only the term at the correct index c survives:

H(P, Q) = -Σᵢ Pᵢ log Qᵢ = -(0 · log Q₀ + ... + 1 · log Q_c + ... + 0 · log Q_{V-1}) = -log Q_c

Same as our simple -log(probability for the correct answer).

Connecting back to LLM training. Now we can connect every piece:

The LLM emits logits — raw scores for each possible next token.
Softmax turns logits into probabilities Q.
We have a correct answer — token index c.
Cross-entropy computes loss = -log(Q[c]).
We compute gradients of this loss w.r.t. every parameter (using the chain rule we just learned).
Gradient descent nudges parameters so Q[c] is bigger next time.

Repeat across billions of training examples → you get an LLM. The full training loop in 6 bullets.

KL divergence (briefly). Cross-entropy compares a true distribution to a predicted one. KL divergence compares two predicted distributions:

KL(P || Q) = Σᵢ Pᵢ log(Pᵢ / Qᵢ)

Always ≥ 0; zero when P and Q are identical. Algebraic identity: KL(P||Q) = H(P,Q) - H(P). With fixed P, minimizing cross-entropy is equivalent to minimizing KL.

Where KL shows up: RLHF and DPO (Day 14). Used to keep a fine-tuned model close to its base — math for "stay close to where you started."

Tokenization

What's a tokenizer?

The core problem. Every operation we covered today — matmuls, softmax, cross-entropy — operates on numbers. But LLM input is text. There's no way to multiply a string by a matrix. So before any math happens, text gets converted into a sequence of integers. The thing that does the conversion is the tokenizer.

Three approaches, one winner. You could split text by character (tiny vocab, but sequences become very long — and attention is quadratic in sequence length). You could split by word (short sequences, but vocab explodes and rare words have no IDs). Modern LLMs split by subword using BPE (Byte Pair Encoding) — common words get one ID, rare words get split into chunks. Best of both. Day 5 builds BPE from scratch.

Vocabulary size = V. The fixed list of tokens a tokenizer knows. Same V you'll see below in (B, T, V) logits.

Model	Vocab size `V`
GPT-2	50,257
GPT-3.5 / GPT-4 (`cl100k_base`)	100,277
GPT-4o (`o200k_base`)	200,019
LLaMA 2	32,000
LLaMA 3	128,256
Mistral / Mixtral	32,000
Gemma	256,000

Bigger V = each token carries more meaning, but the embedding table (V, D) and LM head (D, V) grow too. Trade-off, set once at training time.

Properties to remember:

Deterministic + reversible. decode(encode(x)) == x.
Trained, not hand-coded. Learned from a corpus before LLM training begins.
Frozen with the model. Can't swap tokenizers on a trained model — the embedding table is indexed by its tokenizer's IDs.
Rule of thumb. For English, 1 token ≈ ¾ word ≈ 4 chars. So 1,000 tokens ≈ 750 words ≈ 1.5 pages.

Takeaway: a tokenizer is the deterministic bridge from text to integers. It defines V, and it's the first thing that runs — before any math. Day 5 builds one.

Embeddings

What's an embedding?

The problem with raw integer IDs. The tokenizer hands you a list of integers like [464, 3797, 3332]. But integer 5379 has no useful "geometric" relationship to integer 5380 — they're just different labels. Neural networks need vectors of floats (so they can dot-product, add, multiply by matrices) and they need similar tokens to be near each other in space (so the model can learn that "cat" and "dog" play similar roles). Integers carry neither.

The solution: a lookup table. The model owns a learned matrix called the embedding table, written E, of shape (V, D):

V = vocabulary size (from the tokenizer).
D = hidden dimension (a hyperparameter — GPT-2 uses 768, LLaMA-7B uses 4096).
Row i of E is the vector for token i — a D-dimensional point in space.

Embedding lookup: integer ID i picks row i of E. (T,) ints become (T, D) floats. For a batch, (B, T) → (B, T, D).

Embedding a sequence of token IDs is just fancy indexing:

ids = tokenizer.encode("Hello, world!")    # shape (T,)   e.g. [9906, 11, 1917, 0]
x   = E[ids]                                # shape (T, D)

For a batch, the same lookup turns (B, T) integer IDs into (B, T, D) vectors — and (B, T, D) is the shape of the "residual stream" that flows through every transformer block. This is the only place text-as-integers becomes vectors-of-floats. From here on, everything is matmul.

In PyTorch it's one line:

import torch.nn as nn
embed = nn.Embedding(V, D)    # creates a (V, D) learnable table
x = embed(ids)                # shape (B, T, D)

Embeddings are learned, not designed. When training starts, E is filled with random numbers. With each gradient-descent step, the rows for tokens that appeared in the training batch get nudged. After billions of steps, semantically related tokens end up clustered together in D-dimensional space, and the geometric structure of E carries real meaning. You don't write the embeddings — gradient descent does.

Tying it back to V and D. The embedding table has V × D parameters. For LLaMA-7B (V=32,000, D=4096), that's 131M parameters in the embedding table alone — meaningful but a small slice of the 7B total. Many models share the same matrix between the embedding (input) and the LM head (output) — called weight tying — to save those parameters.

Takeaway: an embedding is a learned D-dim vector for each vocab token. The table E of shape (V, D) is the bridge from token IDs to vectors — the step that turns (B, T) ints into (B, T, D) floats so the rest of the model can do math on them. Day 5 looks at it in detail.

Mental Model

What this last section is about.

You've now learned vectors, matrices, dot products, matrix multiplication, derivatives, the chain rule, softmax, and cross-entropy. You have all the math you need.

Before we close out Day 1, let's do something practical with it: trace data flowing through a real LLM, end-to-end.

The objective. By the end of this section, you should be able to say what shape the data has at every point in an LLM forward pass — from raw text input all the way to a probability distribution over the vocabulary. You should be able to predict, when you see print(x.shape), whether the printed shape makes sense.

Why this matters. When you're reading transformer code (or debugging your own), the single most useful thing you can do is track shapes. Most ML bugs are shape mismatches. Senior ML engineers think in shapes the way senior backend engineers think in HTTP status codes — automatically.

A note before we begin. This section will use words like embedding, transformer block, and LM head that we haven't fully explained yet. That's OK. You don't need to understand what each operation does yet. You just need to understand what each operation does to the shape. By Day 7 every word here will be fully explained. Today, you're installing the shape skeleton — a mental scaffolding the rest of Week 1 will fill in.

The four shape variables you'll see constantly:

Letter	Meaning	Typical value
`B`	Batch — sequences in flight at once	1 to 256
`T`	Sequence length, in tokens	128 to 8192+
`D`	Hidden dimension (vector size per token)	512 to 16384
`V`	Vocabulary size	32k to 128k
`L`	Number of Layers stacked	12 to 80+

A token vector lives in ℝ^D. A whole batch of sequences lives in ℝ^{B×T×D} — three dimensions: which sequence, which token in the sequence, which entry of the token's vector.

A (B, T, D) tensor: B sequences, each with T tokens, each token represented by a D-dim vector.

Mental Model

The forward pass, step by step.

We'll use small numbers so you can hold everything in your head: B=2, T=5, D=8, V=100. (Real models are 100× bigger, but the structure is identical.)

Imagine two short prompts:

Prompt 1: "Hello world! This is fun"   (5 tokens)
Prompt 2: "The cat sat on mat"           (5 tokens)

Step 0 — Tokenize. Each prompt becomes a list of integer IDs.

ids[0] = [15496, 11, 995, 1212, 318]
ids[1] = [25699, 22, 887, 8888, 4444]
ids.shape = (2, 5)    # (B, T)

Step 1 — Embedding lookup. The model has an embedding table E of shape (V, D). Each row of E is a learned vector for one token.

E.shape = (100, 8)
x = E[ids]
x.shape = (2, 5, 8)    # (B, T, D) — gained a dim!

Now every integer ID has been replaced by an 8-dim vector. This is the only "lookup" in the forward pass — from here on, everything is matmul.

Step 2 — Add positional encoding. A position-dependent vector is added so the model knows token order.

x.shape = (2, 5, 8)    # unchanged

Step 3 — L transformer blocks. Each block reads x, does attention + feedforward, returns a tensor of the same shape.

for each of L blocks:
    x = block(x)
    # x.shape = (2, 5, 8) at every step — shape never changes!

Examples: GPT-2 small has L=12. LLaMA-7B has L=32. GPT-4 reportedly L≈120.

Step 4 — Final layer norm.

x.shape = (2, 5, 8)    # still unchanged

Step 5 — LM head: project to vocab logits. A single matmul converts each D-dim hidden vector into V scores.

W_lm.shape = (8, 100)    # (D, V)
logits = x @ W_lm         # the matmul we just learned!
logits.shape = (2, 5, 100)    # (B, T, V)

(Matmul intuition #2 from earlier: same transformation applied to every token row.)

Shape arithmetic for the LM head: (B, T, D) @ (D, V) → (B, T, V). Same matmul rule we computed by hand earlier.

Step 6 — Softmax over the last axis.

probs = softmax(logits, axis=-1)
probs.shape = (2, 5, 100)

Each probs[b, t, :] is now a probability distribution over the vocabulary — telling us how likely each token is to come next.

Step 7 — Pick the next token (during generation).

next_token_probs = probs[:, -1, :]    # last position, shape (B, V)
next_token = next_token_probs.argmax(-1)    # shape (B,)

Picked token gets appended to the input → run again → autoregressive generation.

That's the whole forward pass. Every modern LLM — GPT-4, Claude, LLaMA, Mistral, Gemini, DeepSeek — follows this same skeleton. Different sizes, different details inside the blocks, same shape walk.

The diagram below combines all eight stations into a single flow. Outlined boxes mark the only two places where the shape actually changes (Steps 2 and 6). Everything else is shape-preserving.

The full Day-1 forward pass. Two shape transitions: (B,T) → (B,T,D) at embedding, and (B,T,D) → (B,T,V) at the LM head. Everything in between is shape-preserving.

The three "big" shapes to watch:

(B, T, D) — the "residual stream." Almost every intermediate tensor has this shape. D is the model's "channel count."
(D, V) — the LM head matrix. Maps hidden state → vocab logits. For LLaMA-7B (D=4096, V=32000), 131M parameters in the LM head alone.
(B, h, T, T) — attention scores (Day 6, h = number of heads). The T × T means every token has a similarity score with every other token. Quadratic in T — what FlashAttention attacks on Day 21.

Connecting back to loss. Now we can locate where cross-entropy fits in the pipeline:

ids → embeddings → blocks → logits → softmax → probs → cross-entropy │ ▼ loss

During training: every position has a "correct next token" — cross-entropy computes loss for each, average them, backprop, update parameters.

During inference: no correct answer — argmax (or sample) the logits at the last position, append, repeat.

Same model. Same forward pass. Different post-processing.

Debugging mantra: 80% of ML bugs are shape mismatches. Add print(x.shape) liberally.

Exercise

Compute it by hand, then in code.

Companion notebook: day-1-math.ipynb. Type the code; don't copy-paste. The point is to feel the numbers and the shapes.

Dot product and matmul by hand. Compute [1, 2, 3] · [4, 5, 6] on paper, then verify with NumPy. Multiply A = [[1, 2], [3, 4]] by B = [[5, 6], [7, 8]] entry by entry, and confirm that A @ B ≠ B @ A.
Derivative as a slope. For f(x) = x², estimate the slope at x = 3 by finite difference ((f(3.001) − f(3)) / 0.001) and check it approaches the exact value 2x = 6.
Chain rule. For y = (3x + 1)², compute dy/dx two ways — expand first, then via the chain rule — and confirm they agree at x = 2 (you should get 42).
Softmax from scratch. Implement softmax([2.0, 1.0, 0.0]) and confirm the output sums to 1. Then feed it [1000, 999, 998], watch it overflow, and fix it with the max-subtraction trick.
Cross-entropy. For the prediction Q = [0.7, 0.2, 0.1] with correct answer token 0, compute −log(Q[0]). Then build the full table of loss values from the lesson and confirm a random guesser over V tokens scores about log(V).
Trace the shapes. Using B=2, T=5, D=8, V=100, write out the shape after every station of the forward pass — from (B, T) token IDs to (B, T, V) logits — and mark the only two steps where the shape changes.

Self-Check

Nine questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

In x ∈ ℝ^d, what does each of x, ∈, ℝ, and ^d mean? Translate each symbol to Python.
Why do we subtract the max in softmax? What goes wrong without it, and why does subtracting max(z) not change the output?
Why is matrix multiplication not commutative? Give a shape-based argument — when would A @ B and B @ A both be legal, and do they give the same result?
Compute by hand: softmax([1, 0, -1]). Sanity check that the answer sums to 1. (Hint: exp(1) ≈ 2.718, exp(0) = 1, exp(-1) ≈ 0.368.)
For an LLM with input shape (B, T), walk through how it becomes (B, T, V) logits. At which two steps does the shape change, and what are the input and output shapes at each?
Why is character-level tokenization expensive at inference time? (Hint: which tensor in the forward pass is quadratic in T? What does that mean for a 10× longer sequence?)
Why can't an LLM operate directly on raw integer token IDs? What does the embedding table give you that integers don't, what shape is it, and what shape comes out of the lookup?
A model generates tokens one at a time. At each step it runs the full forward pass and picks probs[:, -1, :].argmax(-1). Why the last position? Why not position 0?
An inference engineer says "this model is memory-bandwidth-bound at batch size 1." What does that mean in terms of the matmul operations happening at each token step, and what would you do to shift it toward compute-bound?

Math You Actually Need

A small set of operations, used everywhere.

Learning objectives

Decoding every symbol you'll see today.

Vectors, matrices, and dot products.

Matrix multiplication, by hand.

A derivative is a slope. That's it.

The chain rule — by example, then abstract.

Distributions and the softmax function.

Why we need a loss function.

Building it from intuition.

A worked example, walked slowly.

The general formula and the training loop.

What's a tokenizer?

What's an embedding?

What this last section is about.

The forward pass, step by step.

Compute it by hand, then in code.

Nine questions before moving on.

Go deeper.

3Blue1Brown — Essence of Linear Algebra

3Blue1Brown — Essence of Calculus

Mathematics for Machine Learning

Goodfellow et al. — Deep Learning

CS229 Linear Algebra Review

The Matrix Cookbook