Last week you built a decoder-only Transformer. Its weights are random — it generates noise. Pre-training is the process that turns random weights into a model that has absorbed the statistical structure of a trillion-token corpus. Before you can run the training loop on Day 9, you need three things cold: the exact objective being optimized, how raw text becomes packed tensor blocks, and the scaling laws that tell you how big a model and how much data your compute budget actually buys.
You finished Week 1 holding a complete decoder-only Transformer. It produces logits, but its weights are random — it generates noise. Pre-training is the process that turns those random weights into a model that has absorbed the statistical structure of a large text corpus. This is where the overwhelming majority of an LLM's capability comes from. Fine-tuning and alignment (Days 13–14) only steer a model that already knows how to model language; pre-training is what teaches it language in the first place.
Today is conceptual and quantitative, with just enough code to make the ideas concrete. We answer three questions. What exactly do we optimize? The causal language-modeling objective, its loss, and the train/inference gap it creates. What do we feed it? How a raw text corpus becomes packed token blocks — collection, deduplication, filtering, tokenization, and packing. How big should the model and dataset be? Scaling laws — the single most important quantitative tool for planning a training run. Tomorrow, on Day 9, you apply all of this to actually train your own GPT.
C ≈ 6ND rule and apply it to GPT-3 and to your Day 9 model.A decoder-only LLM is trained with one objective: given all previous tokens, predict the next one. This is called causal (or autoregressive) language modeling. There are no labels to collect — the "label" for each position is simply the token that actually came next in the text. This is why pre-training is called self-supervised: the supervision signal is generated for free from raw text, with no human annotation required.
Formally, for a sequence of tokens x₁, x₂, …, x_T, the model defines a probability distribution over the whole sequence by factorizing it left to right:
That loss L is exactly cross-entropy between the model's predicted distribution and the one-hot "true next token". You implemented cross-entropy on Day 1 and used it on Days 3–4. Nothing new is needed — an LLM is a classifier with a vocabulary-sized output, run once per position, with the targets shifted by one.
In code, the targets are just the inputs shifted left by one position. If the model sees tokens [The, cat, sat, on, the], the targets are [cat, sat, on, the, mat] — at each position, the answer is simply the token that comes next. The causal mask you built on Day 6 stops position t from seeing anything past itself, so the model can never peek at the answer. One forward pass scores a prediction at every position simultaneously, and the loss averages over all of them. That's a massive data efficiency gain: a sequence of length T provides T−1 training examples per forward pass.
import torch
import torch.nn.functional as F
# logits: (B, T, V) from the model; tokens: (B, T) input IDs
def lm_loss(logits, tokens):
# Predict token t+1 from positions 0..t-1 -> drop last logit, skip first token.
logits = logits[:, :-1, :] # (B, T-1, V)
targets = tokens[:, 1:] # (B, T-1)
return F.cross_entropy(
logits.reshape(-1, logits.size(-1)), # (B*(T-1), V)
targets.reshape(-1), # (B*(T-1),)
)
During training the model receives the ground-truth previous tokens at every position, regardless of what it would have predicted. This technique is called teacher forcing. It makes training stable and parallelizable — every position in the sequence is scored in a single forward pass. But it creates a subtle divergence from inference time.
During inference there is no ground-truth sequence to condition on. The model generates token by token, feeding each predicted token back as the input to the next step. If the model makes a mistake at step t, that mistake becomes part of the input at step t+1, which can compound. This gap between training distribution (ground truth context) and inference distribution (model's own context) is called exposure bias.
In practice, pre-trained LLMs handle this remarkably well, thanks to scale and diverse data. But it is an important nuance for inference engineers. Techniques like scheduled sampling (sometimes feed the model's own prediction during training), nucleus sampling strategies to avoid error compounding, and speculative decoding (Day 23) all connect back to this train/inference mismatch. Understanding that training runs under teacher forcing is essential context for understanding generation quality issues you will debug later.
That is the entire training objective. Everything else today — data, scaling — is about doing this at scale, efficiently, with the right amount of data.
The raw cross-entropy loss is in nats (natural-log units). Two derived metrics are easier to reason about, and one of them — bits-per-byte — is the go-to in scaling-law papers because it is tokenizer-independent.
Perplexity is the exponential of the cross-entropy loss. Intuitively, it is the model's "effective branching factor" — the number of equally-likely next tokens the model is, on average, choosing between. A perplexity of 1 means perfect prediction; a perplexity equal to the vocabulary size means the model has learned nothing and is guessing uniformly. Any decent model on English text lands between these two extremes.
Perplexity depends on the tokenizer. A model with a vocabulary of 128 000 tokens can look artificially better than one with 32 000, because fewer tokens cover the same text. To compare models fairly across tokenizers, researchers use bits-per-byte (bpb): convert the loss from nats to bits, then normalize by the number of raw UTF-8 bytes (not tokens). It is tokenizer-independent by construction.
A language model that perfectly predicted English text would compress it to its true entropy — Shannon estimated English at roughly 1 bit per character in 1951. Modern LLMs get remarkably close, which is why "compression is intelligence" is more than a slogan: the pre-training objective is literally lossless compression of the training corpus.
When you watch a training run, you stare at the loss curve. Understanding its shape helps you distinguish a healthy run from a broken one before you waste GPU-days.
Two specific danger signs to watch for. Loss stuck at ln(V) (the uniform-predictor baseline) from step zero: the gradient is not flowing, or the learning rate is zero. Loss that explodes or oscillates wildly: learning rate too high, or gradient clipping not in place. A slow but steady improvement, even tiny, is always a healthier sign than flat.
The objective is simple; the data work is where most of the real effort and most of the quality gains live. Frontier teams will tell you that data curation matters more than architecture tweaks. The pipeline that turns the open web into training-ready tensors has several stages.
The corpus starts with web crawls (primarily Common Crawl), augmented by curated sources: Wikipedia, books, academic papers, and code from GitHub. Each contributes different characteristics — books supply long-range coherence; code supplies structured, exact reasoning; Wikipedia supplies factual density. The mix matters enormously for capability. Language identification (typically fastText) separates languages so you can apply per-language budgets; most frontier models weight English heavily and include dozens of other languages at lower proportions.
Raw web data is overwhelmingly garbage — spam, boilerplate, auto-generated SEO text, adult content, and incoherent fragments. Filtering uses heuristics (minimum word count, ratio of alphabetic to total characters, perplexity against a reference model), and sometimes a trained quality classifier trained to distinguish Wikipedia-quality writing from forum posts. The C4 dataset pioneered this approach; FineWeb and Dolma are more recent examples with well-documented pipelines.
Exact deduplication removes documents that hash identically (SHA-256 of the text). Easy and fast. Near-deduplication is harder but more important: the web is full of near-identical copies of news articles, lightly paraphrased boilerplate, and template pages. MinHash / LSH (locality-sensitive hashing) is the standard approach: represent each document as a set of character or token n-grams, generate a MinHash signature, and bucket signatures with LSH to find candidates for deduplication at scale. Lee et al. (2022) showed that deduplication consistently improves downstream benchmark performance, sometimes dramatically — duplicated data trains the model to memorize rather than generalize.
Before training, n-grams from held-out evaluation sets (MMLU, HellaSwag, HumanEval, etc.) are removed from training data to prevent benchmark contamination. Typically a 13-gram overlap is used as the threshold. This is often skipped for speed in small experiments but is mandatory for serious reported benchmarks.
After filtering, you are left with a corpus that is still mostly web text. But web text is not the highest-quality language for every capability you want. Practitioners upsample high-value domains — code, curated books, academic papers — and downsample low-quality domains like comment threads and forums, even past their natural frequency in the corpus. Getting this mix right is a large source of differentiation between pre-trained models. The exact mix is often proprietary; open datasets publish theirs as a research contribution.
| Dataset | Size (tokens) | Key trait | Dedup strategy |
|---|---|---|---|
| C4 (2020) | ~156B | Quality-filtered Common Crawl; foundational | Heuristic + line-level |
| The Pile (2021) | ~825B | 22 diverse domains; first principled mix | Fuzzy + exact per domain |
| RedPajama v2 (2023) | ~30T (raw) | Reproducible, quality-annotated CC + curated | MinHash near-dup |
| FineWeb (2024) | ~15T | Best-in-class CC cleaning; well-documented | MinHash + quality classifier |
| Dolma (2024) | ~3T | Open, documented pipeline; OLMo's corpus | MinHash + content filtering |
Each document is tokenized with the BPE tokenizer from Day 5. Then comes packing: rather than treating each document as one training example and padding to a fixed length (which wastes compute on pad tokens), the entire corpus is concatenated into a single long stream of token IDs, with a special <eos> / document-separator token between documents. That 1-D stream is then reshaped into (num_blocks, block_size). Every block is full, every token contributes to the loss, and a batch is just a random selection of blocks.
<eos> separators into one flat token stream, then reshaped into fixed-length blocks of size T. A document may straddle a block boundary (block 1 above), which is acceptable — the model sees the <eos> token and can learn to treat it as a context break. This approach eliminates all padding waste.A nuance: in the packed setup, token positions from two different documents sit in the same block. Without additional care, the causal mask allows the first token of Document B to attend to the last token of Document A, even though they are semantically unrelated. For most training runs this is accepted as an approximation and causes minimal harm. Some implementations — particularly those caring about very long context — insert a document-boundary attention mask that prevents cross-document attention within a block. This is more correct but adds implementation complexity and communication overhead.
A subtle but important point: frontier models often train for roughly one pass over their data, sometimes less. Because the corpus is so large (trillions of tokens), the model rarely sees the same token sequence twice. This is the opposite of the many-epochs regime from supervised deep learning, and it changes how you think about regularization. With deduplicated data and a single epoch, classic overfitting is much less of a concern than data quality and quantity.
There is a modern nuance called data-constrained scaling: what happens when you run out of unique data before reaching the compute-optimal token count? Recent work (Muennighoff et al., 2023) shows that repeating data a small number of times (~4× at most) causes only modest degradation, but heavy repetition is clearly harmful. This matters for anyone trying to scale up on a limited corpus — repeating is better than stopping, but not as good as fresh data.
The defining empirical discovery of the LLM era is that loss falls as a smooth power law in three quantities: model size N (parameters), dataset size D (tokens), and compute C (FLOPs). Plot loss against any of them on log-log axes and you get a straight line over many orders of magnitude. This is what makes large training runs plannable — you can fit the curve on small, cheap runs and extrapolate to predict the loss of a run you cannot yet afford.
OpenAI's original scaling-laws paper established the power-law form. Its key — and now controversial — claim: when compute is the bottleneck, you should spend most of it on a bigger model and comparatively less on more data. The paper showed that parameters scale more favorably than tokens at the compute-optimal frontier (given the experimental setup). This drove the "make it huge" era — GPT-3 at 175 billion parameters trained on only ~300 billion tokens. As a ratio that is roughly 1.7 tokens per parameter, far below the Chinchilla number.
DeepMind revisited the question with a more careful experimental setup — running more controlled iso-FLOP experiments, fitting a richer functional form, and avoiding some statistical artifacts of the Kaplan setup. The conclusion: most large models of that era were significantly undertrained. For a fixed compute budget, model size and data should scale together, in roughly equal proportion. The famous rule of thumb:
The clean interpretation: if you have a fixed FLOP budget, the loss at both N/2 and 2× data versus 2N and N/2 data lands in the same neighborhood — you are on the iso-FLOP curve. But the Chinchilla point on that curve minimizes loss per FLOP.
D ≈ 20N. The inference-optimal regime (lower-right, small model heavily trained) is where LLaMA-3 and Mistral live — traded training compute for cheaper serving cost.Chinchilla optimizes for training compute. But a model you deploy is trained once and run billions of times. If you will serve it heavily, it pays to train a smaller model on far more data than compute-optimal — you spend extra FLOPs at training time to get a cheaper, faster, more memory-efficient model forever after. The marginal cost of extra training tokens is low compared to the cumulative inference cost of running a 2× larger model on every request.
This is exactly why LLaMA models are trained well past the 20× ratio: LLaMA-3 8B saw roughly 15 trillion tokens — about 1 875 tokens per parameter. Mistral 7B similarly. For an inference engineer, this is the key takeaway: the model you serve was sized for serving, not for a training-compute leaderboard. You will be running a "Chinchilla-overtrained" model, which means its weights pack more knowledge per parameter than a Chinchilla-optimal equivalent — a good thing for serving cost.
You can estimate the total floating-point operations to train a dense Transformer with a beautifully simple rule: about 6 FLOPs per parameter per token. This factor breaks down into 2 for the forward-pass multiply-accumulate (one multiply + one add per weight, each time a token visits a parameter) and 4 for the backward pass, which is roughly twice the cost of the forward (it computes both the weight gradient and the input gradient).
C ≈ 6ND rule decomposes as 2 FLOPs for the forward pass, 2 for the weight gradient (backward), and 2 for the input gradient (backward). The approximation ignores small terms (embedding layers, layer norms, biases) and is accurate to within ~10% for large dense Transformers.You will use this constantly: to sanity-check published training costs, to estimate how long your own run will take, and to decide what is even feasible on your hardware. Commit C ≈ 6ND and D ≈ 20N to memory — together they let you plan any training run on the back of an envelope.
The C ≈ 6ND rule counts multiply-accumulate ops in the weight matrices, not activations or normalization. For a mixture-of-experts model where only a fraction k/E of experts activate per token, the effective count scales down accordingly — this is one reason MoE models (like Mixtral) can serve at much lower inference FLOP cost than their total parameter count suggests.
Let's apply today's tools to the Day 9 build. We want a tiny GPT of roughly N ≈ 10M parameters. What does the theory say?
| Quantity | Formula | Day 9 value | Notes |
|---|---|---|---|
| Compute-optimal tokens | D = 20N |
~200M tokens | TinyShakespeare has ~300K — we are in multi-epoch territory |
| Training FLOPs | C = 6ND |
~1.2×10¹⁶ | Seconds to minutes on any modern GPU or Apple Silicon |
| Expected final loss | Power-law extrapolation | ~1.4–1.6 nats | Over a 65-char vocabulary; ppl ~4–5 |
| Expected bits-per-byte | bpb = loss / ln(2) |
~2.0–2.3 bpb | Higher than a BPE model — char-level is harder per byte |
The key pedagogical contrast: our tiny run is in a "small data, many epochs" regime — not the single-epoch frontier world described by Chinchilla. That is deliberate. Running in a regime we can actually afford on a laptop teaches the mechanics. Watching the loss curve hit those expected values tomorrow will confirm the theory is real, not just math on paper.
One more planning calculation: what model could a real compute budget buy? With a single H100-day (~3.5×10¹⁹ FLOPs at ~40% utilization), the Chinchilla-optimal model is roughly N ≈ 540M params trained on D ≈ 10.8B tokens (since N≈sqrt(C/120), D≈20N). A week of H100 time gives you roughly a 1.4B-param model on ~28B tokens. These are sanity-check anchors worth remembering.
Companion notebook: day-8-pretraining.ipynb.
F.cross_entropy) and verify it matches F.cross_entropy on a random (B, T, V) logit tensor. Use an assertion to confirm they agree to 1e-5.V tokens gives loss ln(V) and perplexity V. Compute bpb for a loss = 3.38 model with average 4 bytes/token.<eos> between documents, and reshape into (N, T) blocks. Write a get_batch() that returns random (x, y) pairs with y shifted by one.train_flops(N, D) using 6ND. Reproduce the GPT-3 estimate from scratch. Compute the FLOPs for your Day 9 model and divide by your hardware's throughput to predict wall-clock time. Check tomorrow.C, solve for the compute-optimal N and D. What model and token count does a single H100-day buy? What about a week?tokens_per_param = 200 (LLaMA-style overtraining). For the same compute budget, how much smaller is the model? Estimate the inference speedup from halving the model's parameter count.Close the page and answer from memory. If you can't, re-read the relevant section.
L = 3.38 nats to perplexity and bits-per-token."The model isn't memorizing the internet — it's compressing it. Compression to the entropy floor is the whole objective."
Hand-picked references for objectives, data, and scaling laws.
The original power-law paper. Establishes loss vs. N, D, C and the "go big" conclusion.
Open paperCompute-optimal training. The 20 tokens/param rule and why Gopher was undertrained.
Open paper175B params, 300B tokens. The "make it huge" data point, and the scale-to-capability story.
Open paperA modern, well-documented open pre-training dataset and its full curation recipe.
Read report800 GB diverse text across 22 domains. A canonical, well-described open corpus.
Open paper3.5-hour overview; the pre-training and data sections are the best free explainer available.
Watch on YouTubeData and scaling lectures go far deeper than we can here. Assignment 2 implements the full pipeline.
Open courseWhy dedup makes models better. Concrete, measurable gains. The MinHash/LSH methodology.
Open paperWhat happens when you repeat data. Up to ~4× repeats is survivable; beyond that, quality degrades sharply.
Open paperThe OLMo pre-training corpus. Fully open pipeline documentation — a rare and valuable reference.
Open paper