You don't need to be a mathematician — you need to be fluent with a small set of operations. The way you're fluent with for-loops. Every symbol decoded the first time it appears.
Every modern LLM is built from a surprisingly small set of mathematical operations, repeated at enormous scale. You do not need a degree in mathematics to understand them. You need to be fluent with a handful of ideas — the way you are fluent with for-loops and dictionaries — so that when you read transformer code or a research paper, the symbols read as plainly as Python.
Today we build that fluency from the ground up. We decode every symbol the first time it appears, work each idea through a concrete numerical example before stating the general rule, and connect each piece directly to where it shows up inside an LLM. By the end you will have traced data all the way through a forward pass — from raw text to a probability distribution over the vocabulary — thinking in shapes the whole way.
This is the foundation the entire rest of Week 1 stands on. Take your time with it.
(B, T) token IDs to (B, T, V) logits.If you've never read a math paper, the symbols can look intimidating. They aren't. Here's everything you'll see in this lesson.
Variable names — just like in code:
x, y, a, W, θ — names. Lowercase = number or vector. Uppercase = matrix. Greek letters = often parameters or angles.Sets of numbers:
ℝ — "the reals." All real numbers — Python float. So 5.0, -3.14, π are in ℝ.ℕ — natural numbers (0, 1, 2, ...). Python int (non-negative).ℤ — all integers, positive and negative.Set membership:
∈ — "is in" / "belongs to." Same as Python's in.x ∈ ℝ — "x is a real number."x ∈ ℝ^d — "x is a vector of d real numbers."W ∈ ℝ^{m×n} — "W is an m-by-n matrix."Subscripts and big operators:
xᵢ ("x sub i") — the i-th entry of vector x.Σ (sigma) — sum. Same as Python sum(...).Π (pi) — product. Same as math.prod(...).Calculus and probability:
df/dx — derivative of f w.r.t. x. The slope of f at x.∂f/∂x ("partial f, partial x") — partial derivative when f has multiple inputs.∇f ("del f" or "gradient of f") — vector of all partial derivatives.𝔼[X] — expected value (mean).exp(x) = e^x, log(x) = natural log.≈ — approximately equals.You don't need to memorize this. As we encounter each symbol below, it'll be re-explained in context.
The notation we use for matrix multiplication (with subscripts: A_ij) was introduced by Arthur Cayley in 1858 — over 100 years before computers existed.
A vector is a 1-dimensional array of numbers. Like a Python list of floats:
x = [3.0, 1.0, 4.0] # a vector with 3 entries
The number of entries is the vector's dimension. The example above has dimension 3, so x ∈ ℝ^3 — "x is a vector of 3 real numbers."
A matrix is a 2-dimensional array — a rectangular grid. Like a list of lists:
W = [[1, 2, 3],
[4, 5, 6]] # 2 rows, 3 columns
Shape (2, 3). Notation: W ∈ ℝ^{2×3}. In ML, weights are matrices.
Dot product. A single number measuring how "aligned" two vectors are. Multiply matching entries, sum them up.
Concrete example. Take a = [1, 2, 3], b = [4, 5, 6]:
In Python: sum(a[i] * b[i] for i in range(len(a))). In math notation:
Decoding: Σ = sum, i = index variable, aᵢ = i-th entry of a. So this reads "for every index i, multiply aᵢ × bᵢ and add them up." Exactly what we did.
Why we care: dot products measure similarity. Same direction → big positive. Opposite directions → big negative. Perpendicular → zero. Attention is essentially a giant pile of dot products.
The word "softmax" was coined because it's a soft version of argmax — instead of crisply picking the maximum, it gives a smooth probability distribution. Lower temperature recovers argmax; higher temperature approaches uniform.
This is the operation that powers every modern neural network.
Concrete example. Take two matrices:
A = [[1, 2], B = [[5, 6],
[3, 4]] [7, 8]]
Both are shape (2, 2). The product C = A @ B (we use @ for matmul, same as Python/NumPy) is computed entry-by-entry. Each entry of C is a dot product of a row of A with a column of B.
The general rule. If A is shape (m, k) and B is shape (k, n), then A @ B has shape (m, n). The k's must match — that's the dimension we sum along.
Note: matmul is NOT commutative. A @ B ≠ B @ A in general. The shapes wouldn't even let you do it. Order matters.
Why matmuls dominate ML. A linear layer is y = Wx + b. A transformer is hundreds of these stacked. GPUs are matmul machines — most of the silicon is dedicated to making matrix multiplication fast. An H100 can do nearly 1,000 trillion FP16 operations per second on matmul (the newer Blackwell B200 more than doubles that). Optimizing matmuls is, almost literally, optimizing ML.
Why this matters for inference. At serving time, inference is almost entirely matmul. For every generated token, the model executes a forward pass dominated by matrix multiplications against the weight matrices — typically consuming >90% of wall-clock time. There are two bottlenecks: compute (how fast you can do FLOPs) and memory bandwidth (how fast you can read the weight matrices from GPU HBM). For large models running batch size 1, bandwidth is the bottleneck — the weights are read once per token. Increasing batch size amortizes that cost and shifts back toward compute-bound. Everything in Weeks 3–4 (quantization, KV-cache, continuous batching, FlashAttention) is fundamentally about attacking one of these two bottlenecks.
The Greek letter Σ (sigma) for sum was first used by Leonhard Euler in 1755. Most of the math powering modern AI is older than the steam engine.
A derivative answers one question: if I change the input a tiny bit, how much does the output change? It's the slope.
Concrete example. Take f(x) = x². At x = 3, f(3) = 9. What's the slope right there?
Just measure it:
The exact derivative is 6. The general rule: if f(x) = x², then f'(x) = 2x. At x = 3: 2 × 3 = 6. ✓
Notation (three ways, same thing):
df/dx — Leibniz way. Most common in ML papers.f'(x) — Lagrange way. "f prime of x."(d/dx)[f(x)] — operator way.Common derivatives to recognize:
| If f(x) = | then f'(x) = |
|---|---|
| c (constant) | 0 |
| x | 1 |
| x² | 2x |
| xⁿ | n·xⁿ⁻¹ |
| exp(x) | exp(x) (its own derivative!) |
| log(x) | 1/x |
Partial derivatives — when functions have multiple inputs. Notation: ∂f/∂x instead of df/dx. The "∂" is a stylized "d" called "partial."
For f(x, y) = x² + 3xy:
∂f/∂x — treat y as a constant: 2x + 3y∂f/∂y — treat x as a constant: 3xGradient. Stack all the partials into one vector:
Read ∇f as "del f" or "gradient of f." Geometric meaning: it points uphill — the direction of steepest increase. Move opposite the gradient to go downhill — that's gradient descent in one line: params -= learning_rate * gradient.
The chain rule was first written down by Leibniz in 1676. It would take 300 years for it to become "the most important rule in machine learning" — but it was waiting there the whole time.
The chain rule tells you how to differentiate composed functions. It is the rule that makes backpropagation possible.
Concrete example first. Take y = (3x + 1)². What's dy/dx?
Method 1: expand. y = 9x² + 6x + 1, so dy/dx = 18x + 6 = 6(3x + 1).
Method 2: chain rule. Decompose into two simpler functions:
The general rule:
For longer chains, derivatives just multiply along: dy/dx = (df/dg) · (dg/dh) · (dh/dx).
How this becomes backprop. A neural network is a long chain of composed functions:
Each arrow is a function. The whole network is one big composition. Backpropagation is just the chain rule applied to this composition, computed efficiently in reverse order. Day 3 we do it by hand. Day 4 we generalize. Today, just remember: every weight update in every neural network ever trained comes from applying the chain rule.
KL divergence is named after Solomon Kullback and Richard Leibler, who published it in 1951 as a measure of information loss in cryptography. It quietly became the bedrock of modern RLHF 70 years later.
A random variable is a quantity whose value is uncertain — like a die roll, or the next token an LLM produces.
A probability distribution is a list of how likely each value is. Two requirements: every probability is in [0, 1], and they all sum to 1.
Example: a fair die.
P(X=1) = 1/6 ≈ 0.167
P(X=2) = 1/6 ≈ 0.167
... (all six values)
─────
sum = 1.000 ✓
An LLM outputs a distribution like this every step — but with ~100,000 outcomes (one per vocabulary token) instead of 6.
Softmax turns any vector of real numbers (called logits) into a probability distribution.
Concrete example. Logits z = [2.0, 1.0, 0.0]:
The formula:
The numerical stability trick. If z = [1000, 999, 998], computing exp(1000) overflows. Subtract the max — softmax is invariant to a shared constant:
Now the largest exponent input becomes 0 (so exp(0) = 1) and nothing overflows. You'll see this trick again in FlashAttention on Day 21.
Before we look at cross-entropy, let's zoom out and remember what we're actually doing.
The big picture. An LLM is a function: given some text, predict what comes next. To learn that function, the LLM has to be trained. Training means: we show it examples ("the cat sat on the ___") with the correct answer ("mat"), and we adjust its parameters so it gets better at predicting the right answer.
But to "adjust parameters to get better," we first need to measure how wrong the model is right now. We need a number.
That number is called the loss.
The loss is what we computed gradients of (using the chain rule we learned a few sections back), and what we minimize using gradient descent. Backprop and gradient descent — those tools we built up — exist exactly to drive this loss number toward zero.
So our question becomes: given a model's predicted probability distribution and the correct answer, how do we compute the loss?
That's what cross-entropy does. Let's build it from scratch — no formulas yet, just intuition.
Recall: softmax takes the model's logits and produces a probability distribution. So if the model is choosing among three possible next tokens, softmax might give us:
Q = [0.7, 0.2, 0.1]
(Q for "predicted distribution" — that's the convention.) Meaning: 70% chance of token 0, 20% of token 1, 10% of token 2.
Suppose the correct answer is token 0. How "wrong" was the model?
Let's reason about it intuitively:
1.0 → not wrong → loss should be 0.0.7 → mostly right → loss should be small.0.1 → mostly wrong → loss should be bigger.0.001 → confidently wrong → loss should be really big.So we need a function loss(p) that:
0 when p = 1.p shrinks toward 0.Logarithm has exactly the shape we want. Look at how log(p) behaves:
(Reminder: log here means natural log — log base e ≈ 2.718. Same as Python's math.log.)
log(1) = 0 exactly. As p heads toward 0, log(p) heads toward negative infinity. Almost what we want — except we need loss high when p is low, but log gives us negative numbers.
Easy fix: flip the sign. Negative log gives us:
That's exactly the shape we want. Loss = 0 when the model is perfectly confident in the right answer. Loss grows as the model becomes less confident. Loss explodes when the model is confidently wrong.
That's cross-entropy in its simplest form:
That's the whole core idea. Everything else is dressing.
Let's compute cross-entropy for a real example.
Setup:
Q = [0.7, 0.2, 0.1].Step 1. Pick out the probability the model gave to the correct token.
Q[0] = 0.7
(If Q[0] looks unfamiliar: it's just Python-style indexing. The first entry of the list Q.)
Step 2. Take the negative log.
loss = -log(0.7) ≈ 0.357
Done. The loss is 0.357 nats.
(About "nats." When we use natural log, the unit of cross-entropy is called nats. With log base 2 it would be bits. PyTorch uses nats. Doesn't change anything qualitatively.)
The full spectrum of loss values. Here's what cross-entropy looks like across different model predictions, for the same correct answer (token 0):
Model predicts Q | Q[0] | Loss | Interpretation |
|---|---|---|---|
[1.0, 0.0, 0.0] | 1.0 | 0.000 | Perfect — fully confident, fully right |
[0.9, 0.05, 0.05] | 0.9 | 0.105 | Very good |
[0.7, 0.2, 0.1] | 0.7 | 0.357 | Decent |
[0.5, 0.3, 0.2] | 0.5 | 0.693 | Genuinely uncertain |
[0.33, 0.33, 0.34] | 0.33 | 1.099 | Random guessing — note log(3) ≈ 1.099 |
[0.1, 0.6, 0.3] | 0.1 | 2.303 | Confidently wrong |
[0.001, 0.999, 0] | 0.001 | 6.908 | Very confidently wrong |
Key thresholds:
V: loss ≈ log(V). For 3 tokens, log(3) ≈ 1.099. For GPT-2 (V=50,257), log(50,257) ≈ 10.83.You'll actually see this number when you start training a model from scratch. Initial loss should be near log(vocab_size) — the model is essentially guessing. Much higher → bug. Much lower → also probably bug. It's a sanity check you'll use constantly.
In ML papers and code, you'll see cross-entropy written like this:
Now we can decode it carefully:
H(P, Q) — function name. H is conventional for entropy-related quantities (from Boltzmann's 19th-century H-theorem — long detour).P — the true distribution. For classification, this is a one-hot vector — all zeros except a 1 at the correct index.Q — the predicted distribution from the model.Σᵢ — sum over index i. i runs over every entry of the distributions.Pᵢ, log Qᵢ — i-th entries. log is natural log.Reading: "for every index i, multiply Pᵢ by log Qᵢ, sum them all up, then negate."
Why it reduces to our simple form. When P is one-hot, only one entry of P is 1. Everything else gets multiplied by 0 and disappears. Only the term at the correct index c survives:
Same as our simple -log(probability for the correct answer).
Connecting back to LLM training. Now we can connect every piece:
Q.c.loss = -log(Q[c]).Q[c] is bigger next time.Repeat across billions of training examples → you get an LLM. The full training loop in 6 bullets.
KL divergence (briefly). Cross-entropy compares a true distribution to a predicted one. KL divergence compares two predicted distributions:
Always ≥ 0; zero when P and Q are identical. Algebraic identity: KL(P||Q) = H(P,Q) - H(P). With fixed P, minimizing cross-entropy is equivalent to minimizing KL.
Where KL shows up: RLHF and DPO (Day 14). Used to keep a fine-tuned model close to its base — math for "stay close to where you started."
The core problem. Every operation we covered today — matmuls, softmax, cross-entropy — operates on numbers. But LLM input is text. There's no way to multiply a string by a matrix. So before any math happens, text gets converted into a sequence of integers. The thing that does the conversion is the tokenizer.
Three approaches, one winner. You could split text by character (tiny vocab, but sequences become very long — and attention is quadratic in sequence length). You could split by word (short sequences, but vocab explodes and rare words have no IDs). Modern LLMs split by subword using BPE (Byte Pair Encoding) — common words get one ID, rare words get split into chunks. Best of both. Day 5 builds BPE from scratch.
Vocabulary size = V. The fixed list of tokens a tokenizer knows. Same V you'll see below in (B, T, V) logits.
| Model | Vocab size V |
|---|---|
| GPT-2 | 50,257 |
GPT-3.5 / GPT-4 (cl100k_base) | 100,277 |
GPT-4o (o200k_base) | 200,019 |
| LLaMA 2 | 32,000 |
| LLaMA 3 | 128,256 |
| Mistral / Mixtral | 32,000 |
| Gemma | 256,000 |
Bigger V = each token carries more meaning, but the embedding table (V, D) and LM head (D, V) grow too. Trade-off, set once at training time.
Properties to remember:
decode(encode(x)) == x.1 token ≈ ¾ word ≈ 4 chars. So 1,000 tokens ≈ 750 words ≈ 1.5 pages.GPT-3.5 once confidently said "9.11 > 9.9" — partly a tokenization artifact. "9.9" and "9.11" tokenize in ways that hide their numeric structure. Karpathy's minbpe implements BPE in ~300 lines of Python; Day 5 walks through it.
The problem with raw integer IDs. The tokenizer hands you a list of integers like [464, 3797, 3332]. But integer 5379 has no useful "geometric" relationship to integer 5380 — they're just different labels. Neural networks need vectors of floats (so they can dot-product, add, multiply by matrices) and they need similar tokens to be near each other in space (so the model can learn that "cat" and "dog" play similar roles). Integers carry neither.
The solution: a lookup table. The model owns a learned matrix called the embedding table, written E, of shape (V, D):
V = vocabulary size (from the tokenizer).D = hidden dimension (a hyperparameter — GPT-2 uses 768, LLaMA-7B uses 4096).i of E is the vector for token i — a D-dimensional point in space.Embedding lookup: integer ID i picks row i of E. (T,) ints become (T, D) floats. For a batch, (B, T) → (B, T, D).
Embedding a sequence of token IDs is just fancy indexing:
ids = tokenizer.encode("Hello, world!") # shape (T,) e.g. [9906, 11, 1917, 0]
x = E[ids] # shape (T, D)
For a batch, the same lookup turns (B, T) integer IDs into (B, T, D) vectors — and (B, T, D) is the shape of the "residual stream" that flows through every transformer block. This is the only place text-as-integers becomes vectors-of-floats. From here on, everything is matmul.
In PyTorch it's one line:
import torch.nn as nn
embed = nn.Embedding(V, D) # creates a (V, D) learnable table
x = embed(ids) # shape (B, T, D)
Embeddings are learned, not designed. When training starts, E is filled with random numbers. With each gradient-descent step, the rows for tokens that appeared in the training batch get nudged. After billions of steps, semantically related tokens end up clustered together in D-dimensional space, and the geometric structure of E carries real meaning. You don't write the embeddings — gradient descent does.
Tying it back to V and D. The embedding table has V × D parameters. For LLaMA-7B (V=32,000, D=4096), that's 131M parameters in the embedding table alone — meaningful but a small slice of the 7B total. Many models share the same matrix between the embedding (input) and the LM head (output) — called weight tying — to save those parameters.
You've now learned vectors, matrices, dot products, matrix multiplication, derivatives, the chain rule, softmax, and cross-entropy. You have all the math you need.
Before we close out Day 1, let's do something practical with it: trace data flowing through a real LLM, end-to-end.
The objective. By the end of this section, you should be able to say what shape the data has at every point in an LLM forward pass — from raw text input all the way to a probability distribution over the vocabulary. You should be able to predict, when you see print(x.shape), whether the printed shape makes sense.
Why this matters. When you're reading transformer code (or debugging your own), the single most useful thing you can do is track shapes. Most ML bugs are shape mismatches. Senior ML engineers think in shapes the way senior backend engineers think in HTTP status codes — automatically.
A note before we begin. This section will use words like embedding, transformer block, and LM head that we haven't fully explained yet. That's OK. You don't need to understand what each operation does yet. You just need to understand what each operation does to the shape. By Day 7 every word here will be fully explained. Today, you're installing the shape skeleton — a mental scaffolding the rest of Week 1 will fill in.
The four shape variables you'll see constantly:
| Letter | Meaning | Typical value |
|---|---|---|
B | Batch — sequences in flight at once | 1 to 256 |
T | Sequence length, in tokens | 128 to 8192+ |
D | Hidden dimension (vector size per token) | 512 to 16384 |
V | Vocabulary size | 32k to 128k |
L | Number of Layers stacked | 12 to 80+ |
A token vector lives in ℝ^D. A whole batch of sequences lives in ℝ^{B×T×D} — three dimensions: which sequence, which token in the sequence, which entry of the token's vector.
A (B, T, D) tensor: B sequences, each with T tokens, each token represented by a D-dim vector.
We'll use small numbers so you can hold everything in your head: B=2, T=5, D=8, V=100. (Real models are 100× bigger, but the structure is identical.)
Imagine two short prompts:
Prompt 1: "Hello world! This is fun" (5 tokens)
Prompt 2: "The cat sat on mat" (5 tokens)
Step 0 — Tokenize. Each prompt becomes a list of integer IDs.
ids[0] = [15496, 11, 995, 1212, 318]
ids[1] = [25699, 22, 887, 8888, 4444]
ids.shape = (2, 5) # (B, T)
Step 1 — Embedding lookup. The model has an embedding table E of shape (V, D). Each row of E is a learned vector for one token.
E.shape = (100, 8)
x = E[ids]
x.shape = (2, 5, 8) # (B, T, D) — gained a dim!
Now every integer ID has been replaced by an 8-dim vector. This is the only "lookup" in the forward pass — from here on, everything is matmul.
Step 2 — Add positional encoding. A position-dependent vector is added so the model knows token order.
x.shape = (2, 5, 8) # unchanged
Step 3 — L transformer blocks. Each block reads x, does attention + feedforward, returns a tensor of the same shape.
for each of L blocks:
x = block(x)
# x.shape = (2, 5, 8) at every step — shape never changes!
Examples: GPT-2 small has L=12. LLaMA-7B has L=32. GPT-4 reportedly L≈120.
Step 4 — Final layer norm.
x.shape = (2, 5, 8) # still unchanged
Step 5 — LM head: project to vocab logits. A single matmul converts each D-dim hidden vector into V scores.
W_lm.shape = (8, 100) # (D, V)
logits = x @ W_lm # the matmul we just learned!
logits.shape = (2, 5, 100) # (B, T, V)
(Matmul intuition #2 from earlier: same transformation applied to every token row.)
Shape arithmetic for the LM head: (B, T, D) @ (D, V) → (B, T, V). Same matmul rule we computed by hand earlier.
Step 6 — Softmax over the last axis.
probs = softmax(logits, axis=-1)
probs.shape = (2, 5, 100)
Each probs[b, t, :] is now a probability distribution over the vocabulary — telling us how likely each token is to come next.
Step 7 — Pick the next token (during generation).
next_token_probs = probs[:, -1, :] # last position, shape (B, V)
next_token = next_token_probs.argmax(-1) # shape (B,)
Picked token gets appended to the input → run again → autoregressive generation.
That's the whole forward pass. Every modern LLM — GPT-4, Claude, LLaMA, Mistral, Gemini, DeepSeek — follows this same skeleton. Different sizes, different details inside the blocks, same shape walk.
The diagram below combines all eight stations into a single flow. Outlined boxes mark the only two places where the shape actually changes (Steps 2 and 6). Everything else is shape-preserving.
The full Day-1 forward pass. Two shape transitions: (B,T) → (B,T,D) at embedding, and (B,T,D) → (B,T,V) at the LM head. Everything in between is shape-preserving.
The three "big" shapes to watch:
(B, T, D) — the "residual stream." Almost every intermediate tensor has this shape. D is the model's "channel count."(D, V) — the LM head matrix. Maps hidden state → vocab logits. For LLaMA-7B (D=4096, V=32000), 131M parameters in the LM head alone.(B, h, T, T) — attention scores (Day 6, h = number of heads). The T × T means every token has a similarity score with every other token. Quadratic in T — what FlashAttention attacks on Day 21.Connecting back to loss. Now we can locate where cross-entropy fits in the pipeline:
During training: every position has a "correct next token" — cross-entropy computes loss for each, average them, backprop, update parameters.
During inference: no correct answer — argmax (or sample) the logits at the last position, append, repeat.
Same model. Same forward pass. Different post-processing.
Debugging mantra: 80% of ML bugs are shape mismatches. Add print(x.shape) liberally.
Companion notebook: day-1-math.ipynb. Type the code; don't copy-paste. The point is to feel the numbers and the shapes.
[1, 2, 3] · [4, 5, 6] on paper, then verify with NumPy. Multiply A = [[1, 2], [3, 4]] by B = [[5, 6], [7, 8]] entry by entry, and confirm that A @ B ≠ B @ A.f(x) = x², estimate the slope at x = 3 by finite difference ((f(3.001) − f(3)) / 0.001) and check it approaches the exact value 2x = 6.y = (3x + 1)², compute dy/dx two ways — expand first, then via the chain rule — and confirm they agree at x = 2 (you should get 42).softmax([2.0, 1.0, 0.0]) and confirm the output sums to 1. Then feed it [1000, 999, 998], watch it overflow, and fix it with the max-subtraction trick.Q = [0.7, 0.2, 0.1] with correct answer token 0, compute −log(Q[0]). Then build the full table of loss values from the lesson and confirm a random guesser over V tokens scores about log(V).B=2, T=5, D=8, V=100, write out the shape after every station of the forward pass — from (B, T) token IDs to (B, T, V) logits — and mark the only two steps where the shape changes.Close the page and answer from memory. If you can't, re-read the relevant section.
x ∈ ℝ^d, what does each of x, ∈, ℝ, and ^d mean? Translate each symbol to Python.max(z) not change the output?A @ B and B @ A both be legal, and do they give the same result?softmax([1, 0, -1]). Sanity check that the answer sums to 1. (Hint: exp(1) ≈ 2.718, exp(0) = 1, exp(-1) ≈ 0.368.)(B, T), walk through how it becomes (B, T, V) logits. At which two steps does the shape change, and what are the input and output shapes at each?T? What does that mean for a 10× longer sequence?)probs[:, -1, :].argmax(-1). Why the last position? Why not position 0?"You don't need to be a mathematician. You need to be fluent — the way you're fluent with for-loops."
Hand-picked references for this lesson. Free where possible. Books and papers where the depth is irreplaceable.
The single best linear algebra resource ever made. Episodes 1-7 are essential.
Watch seriesBeautiful animations. Episodes 1-4 cover everything we need for derivatives and chain rule.
Watch seriesDeisenroth, Faisal, Ong. Chapters 2 (Linear Algebra), 5 (Calculus), 6 (Probability) cover everything we use.
Download PDFChapters 2-3 are a concise math primer for neural networks.
Read onlinePetersen & Pedersen. Comprehensive reference for derivatives of matrix expressions.
Open PDF