LLM Inference Engineer · Day 08
Day 08 · Week 2 · Training & Architectures
📊

Pre-training: Objective, Data, Scale

Last week you built a decoder-only Transformer. Its weights are random — it generates noise. Pre-training is the process that turns random weights into a model that has absorbed the statistical structure of a trillion-token corpus. Before you can run the training loop on Day 9, you need three things cold: the exact objective being optimized, how raw text becomes packed tensor blocks, and the scaling laws that tell you how big a model and how much data your compute budget actually buys.

Time~150 min
DifficultyMedium
PrerequisiteDay 1–7
Why This Lesson

Architecture is the easy part. Training is data, objective, and budget.

You finished Week 1 holding a complete decoder-only Transformer. It produces logits, but its weights are random — it generates noise. Pre-training is the process that turns those random weights into a model that has absorbed the statistical structure of a large text corpus. This is where the overwhelming majority of an LLM's capability comes from. Fine-tuning and alignment (Days 13–14) only steer a model that already knows how to model language; pre-training is what teaches it language in the first place.

Today is conceptual and quantitative, with just enough code to make the ideas concrete. We answer three questions. What exactly do we optimize? The causal language-modeling objective, its loss, and the train/inference gap it creates. What do we feed it? How a raw text corpus becomes packed token blocks — collection, deduplication, filtering, tokenization, and packing. How big should the model and dataset be? Scaling laws — the single most important quantitative tool for planning a training run. Tomorrow, on Day 9, you apply all of this to actually train your own GPT.

Learning objectives

  1. State the causal language-modeling objective precisely; explain self-supervision, teacher forcing, and exposure bias.
  2. Compute and interpret cross-entropy loss, perplexity, and bits-per-byte — and know what values are "good".
  3. Read and diagnose a training loss curve: the steep early drop, the slow log decline, the entropy floor, and what a broken curve looks like.
  4. Describe the full pre-training data pipeline: collection, language ID, quality filtering, exact and near-duplicate dedup, decontamination, domain mixing, tokenization, and packing into fixed-length blocks.
  5. Explain Kaplan vs. Chinchilla scaling laws, the compute-optimal 20 tokens/param rule, and the inference-optimal override used by LLaMA-3.
  6. Derive the C ≈ 6ND rule and apply it to GPT-3 and to your Day 9 model.
The Objective

Predict the next token. That single task is the whole game.

A decoder-only LLM is trained with one objective: given all previous tokens, predict the next one. This is called causal (or autoregressive) language modeling. There are no labels to collect — the "label" for each position is simply the token that actually came next in the text. This is why pre-training is called self-supervised: the supervision signal is generated for free from raw text, with no human annotation required.

Formally, for a sequence of tokens x₁, x₂, …, x_T, the model defines a probability distribution over the whole sequence by factorizing it left to right:

p(x₁, …, x_T) = ∏_{t=1}^{T} p(x_t | x₁, …, x_{t−1}) We maximize the log-likelihood of the training data, equivalently minimize the average negative log-likelihood per token: L = − (1/T) · Σ_{t=1}^{T} log p(x_t | x_{<t})

That loss L is exactly cross-entropy between the model's predicted distribution and the one-hot "true next token". You implemented cross-entropy on Day 1 and used it on Days 3–4. Nothing new is needed — an LLM is a classifier with a vocabulary-sized output, run once per position, with the targets shifted by one.

The shift-by-one trick

In code, the targets are just the inputs shifted left by one position. If the model sees tokens [The, cat, sat, on, the], the targets are [cat, sat, on, the, mat] — at each position, the answer is simply the token that comes next. The causal mask you built on Day 6 stops position t from seeing anything past itself, so the model can never peek at the answer. One forward pass scores a prediction at every position simultaneously, and the loss averages over all of them. That's a massive data efficiency gain: a sequence of length T provides T−1 training examples per forward pass.

At every position, predict the next token input The cat sat on the target cat sat on the mat The target row is just the input row slid left by one — the label is free.
Next-token prediction. Each input position is trained to predict the token immediately to its right, so the targets are simply the inputs shifted left by one. No human labels are involved — the text supervises itself.
import torch
import torch.nn.functional as F

# logits: (B, T, V) from the model; tokens: (B, T) input IDs
def lm_loss(logits, tokens):
    # Predict token t+1 from positions 0..t-1 -> drop last logit, skip first token.
    logits = logits[:, :-1, :]          # (B, T-1, V)
    targets = tokens[:, 1:]             # (B, T-1)
    return F.cross_entropy(
        logits.reshape(-1, logits.size(-1)),   # (B*(T-1), V)
        targets.reshape(-1),                    # (B*(T-1),)
    )

Teacher forcing vs. autoregressive generation — and the train/inference mismatch

During training the model receives the ground-truth previous tokens at every position, regardless of what it would have predicted. This technique is called teacher forcing. It makes training stable and parallelizable — every position in the sequence is scored in a single forward pass. But it creates a subtle divergence from inference time.

During inference there is no ground-truth sequence to condition on. The model generates token by token, feeding each predicted token back as the input to the next step. If the model makes a mistake at step t, that mistake becomes part of the input at step t+1, which can compound. This gap between training distribution (ground truth context) and inference distribution (model's own context) is called exposure bias.

In practice, pre-trained LLMs handle this remarkably well, thanks to scale and diverse data. But it is an important nuance for inference engineers. Techniques like scheduled sampling (sometimes feed the model's own prediction during training), nucleus sampling strategies to avoid error compounding, and speculative decoding (Day 23) all connect back to this train/inference mismatch. Understanding that training runs under teacher forcing is essential context for understanding generation quality issues you will debug later.

Training — Teacher Forcing Ground-truth tokens always fed as input The cat sat on LM (parallel forward pass) Loss at each position — one forward pass. Efficient. Stable. But model never sees own mistakes. Inference — Autoregressive Model's own output fed as next input The cat sat? ?? Sequential. One token at a time. Errors compound → exposure bias. Connects to speculative decoding (Day 23).
Teacher forcing (left) feeds ground-truth tokens at every position, enabling a parallel forward pass. Autoregressive inference (right) feeds the model's own predictions, which can compound errors — this is exposure bias, a fundamental train/inference mismatch.

That is the entire training objective. Everything else today — data, scaling — is about doing this at scale, efficiently, with the right amount of data.

Measuring Progress

Loss, perplexity, bits-per-byte — three views of the same number.

The raw cross-entropy loss is in nats (natural-log units). Two derived metrics are easier to reason about, and one of them — bits-per-byte — is the go-to in scaling-law papers because it is tokenizer-independent.

Cross-entropy → perplexity

Perplexity is the exponential of the cross-entropy loss. Intuitively, it is the model's "effective branching factor" — the number of equally-likely next tokens the model is, on average, choosing between. A perplexity of 1 means perfect prediction; a perplexity equal to the vocabulary size means the model has learned nothing and is guessing uniformly. Any decent model on English text lands between these two extremes.

perplexity = exp(L) # L is the mean per-token cross-entropy in nats Worked examples: uniform over V = 50 257 tokens → L = ln(50257) ≈ 10.83 → ppl ≈ 50 257 GPT-2 (117M) on WebText test → ppl ≈ 29.4 GPT-2 (1.5B) on WebText test → ppl ≈ 17.5 GPT-4-class models on English → ppl ≈ 3–8 theoretical English entropy floor → ppl ≈ e^1.0 ≈ 2.7

Perplexity → bits-per-byte

Perplexity depends on the tokenizer. A model with a vocabulary of 128 000 tokens can look artificially better than one with 32 000, because fewer tokens cover the same text. To compare models fairly across tokenizers, researchers use bits-per-byte (bpb): convert the loss from nats to bits, then normalize by the number of raw UTF-8 bytes (not tokens). It is tokenizer-independent by construction.

bits-per-token = L / ln(2) # 1 nat = 1/ln(2) ≈ 1.443 bits bits-per-byte = bits-per-token × (tokens / bytes) # tokens/bytes is the compression ratio Worked example (GPT-2, BPE, ~4 bytes/token average): L = 3.38 nats → bits/token = 3.38 / 0.693 ≈ 4.87 bytes/token ≈ 4.0 → bits/byte ≈ 4.87 / 4.0 ≈ 1.2 Reference points: optimal English compression (LLaMA-3 70B range) ≈ 0.9–1.1 bpb Shannon's 1951 English estimate ≈ 1.0 bpb a mediocre model ≈ 3–5 bpb

A language model that perfectly predicted English text would compress it to its true entropy — Shannon estimated English at roughly 1 bit per character in 1951. Modern LLMs get remarkably close, which is why "compression is intelligence" is more than a slogan: the pre-training objective is literally lossless compression of the training corpus.

Reading the training loss curve

When you watch a training run, you stare at the loss curve. Understanding its shape helps you distinguish a healthy run from a broken one before you waste GPU-days.

Training steps (log scale) Loss (nats) 10 5 2 entropy floor steep drop (token freqs, bigrams) slow log decline (syntax, facts, long-range dependencies) healthy run broken / flat entropy floor Flat curve: check LR, data pipeline, target shift.
A healthy loss curve drops steeply in early steps as the model learns token frequencies and common bigrams, then enters a long slow logarithmic decline as it acquires syntax, world knowledge, and long-range dependencies. It asymptotes toward the entropy floor of the data — the irreducible uncertainty in natural language. A flat curve from step zero usually means a broken learning rate, incorrect target shift, or a data pipeline bug.

Two specific danger signs to watch for. Loss stuck at ln(V) (the uniform-predictor baseline) from step zero: the gradient is not flowing, or the learning rate is zero. Loss that explodes or oscillates wildly: learning rate too high, or gradient clipping not in place. A slow but steady improvement, even tiny, is always a healthier sign than flat.

The Data Pipeline

Raw web text in, packed token blocks out.

The objective is simple; the data work is where most of the real effort and most of the quality gains live. Frontier teams will tell you that data curation matters more than architecture tweaks. The pipeline that turns the open web into training-ready tensors has several stages.

raw corpus CC / books / code / … lang ID + quality filter fastText / classifier exact + near dedup SHA256 / MinHash decontam- ination remove eval n-grams tokenize BPE → IDs pack into blocks concat + <eos> reshape (N, T) Every stage reduces volume. Typical attrition: 100× raw → 1× clean tokens after dedup + filter. Mixture weights then upsample high-quality domains (code, books) and downsample low-quality (forums).
The pre-training data pipeline. Raw crawl data is reduced by roughly 100× through language identification, quality filtering, and deduplication before a single token is used for training. The final packing step ensures every position in every block is a real token — no compute wasted on padding.

Collection and language identification

The corpus starts with web crawls (primarily Common Crawl), augmented by curated sources: Wikipedia, books, academic papers, and code from GitHub. Each contributes different characteristics — books supply long-range coherence; code supplies structured, exact reasoning; Wikipedia supplies factual density. The mix matters enormously for capability. Language identification (typically fastText) separates languages so you can apply per-language budgets; most frontier models weight English heavily and include dozens of other languages at lower proportions.

Quality filtering

Raw web data is overwhelmingly garbage — spam, boilerplate, auto-generated SEO text, adult content, and incoherent fragments. Filtering uses heuristics (minimum word count, ratio of alphabetic to total characters, perplexity against a reference model), and sometimes a trained quality classifier trained to distinguish Wikipedia-quality writing from forum posts. The C4 dataset pioneered this approach; FineWeb and Dolma are more recent examples with well-documented pipelines.

Deduplication — exact and near

Exact deduplication removes documents that hash identically (SHA-256 of the text). Easy and fast. Near-deduplication is harder but more important: the web is full of near-identical copies of news articles, lightly paraphrased boilerplate, and template pages. MinHash / LSH (locality-sensitive hashing) is the standard approach: represent each document as a set of character or token n-grams, generate a MinHash signature, and bucket signatures with LSH to find candidates for deduplication at scale. Lee et al. (2022) showed that deduplication consistently improves downstream benchmark performance, sometimes dramatically — duplicated data trains the model to memorize rather than generalize.

Decontamination

Before training, n-grams from held-out evaluation sets (MMLU, HellaSwag, HumanEval, etc.) are removed from training data to prevent benchmark contamination. Typically a 13-gram overlap is used as the threshold. This is often skipped for speed in small experiments but is mandatory for serious reported benchmarks.

Domain mixture and upsampling

After filtering, you are left with a corpus that is still mostly web text. But web text is not the highest-quality language for every capability you want. Practitioners upsample high-value domains — code, curated books, academic papers — and downsample low-quality domains like comment threads and forums, even past their natural frequency in the corpus. Getting this mix right is a large source of differentiation between pre-trained models. The exact mix is often proprietary; open datasets publish theirs as a research contribution.

Dataset Size (tokens) Key trait Dedup strategy
C4 (2020) ~156B Quality-filtered Common Crawl; foundational Heuristic + line-level
The Pile (2021) ~825B 22 diverse domains; first principled mix Fuzzy + exact per domain
RedPajama v2 (2023) ~30T (raw) Reproducible, quality-annotated CC + curated MinHash near-dup
FineWeb (2024) ~15T Best-in-class CC cleaning; well-documented MinHash + quality classifier
Dolma (2024) ~3T Open, documented pipeline; OLMo's corpus MinHash + content filtering

Tokenization and packing

Each document is tokenized with the BPE tokenizer from Day 5. Then comes packing: rather than treating each document as one training example and padding to a fixed length (which wastes compute on pad tokens), the entire corpus is concatenated into a single long stream of token IDs, with a special <eos> / document-separator token between documents. That 1-D stream is then reshaped into (num_blocks, block_size). Every block is full, every token contributes to the loss, and a batch is just a random selection of blocks.

Documents (variable length) Doc A (140 toks) <eos> Doc B (200 toks) <eos> Doc C (continues …) concatenate One flat token stream → reshape into (N, T) blocks block 0 [Doc A …] A tail <eos> B head doc boundary block 2 [Doc B …] block 3 [Doc C …] all_ids = concat([doc_tokens + [EOS_ID] for doc in corpus]) blocks = torch.tensor(all_ids).reshape(-1, T) # (N, T) A document can span a block boundary — that is fine. The model learns to recognize <eos> and reset its state. No padding, no wasted compute.
Packing. Documents are concatenated with <eos> separators into one flat token stream, then reshaped into fixed-length blocks of size T. A document may straddle a block boundary (block 1 above), which is acceptable — the model sees the <eos> token and can learn to treat it as a context break. This approach eliminates all padding waste.

Document-boundary attention masking

A nuance: in the packed setup, token positions from two different documents sit in the same block. Without additional care, the causal mask allows the first token of Document B to attend to the last token of Document A, even though they are semantically unrelated. For most training runs this is accepted as an approximation and causes minimal harm. Some implementations — particularly those caring about very long context — insert a document-boundary attention mask that prevents cross-document attention within a block. This is more correct but adds implementation complexity and communication overhead.

How many epochs?

A subtle but important point: frontier models often train for roughly one pass over their data, sometimes less. Because the corpus is so large (trillions of tokens), the model rarely sees the same token sequence twice. This is the opposite of the many-epochs regime from supervised deep learning, and it changes how you think about regularization. With deduplicated data and a single epoch, classic overfitting is much less of a concern than data quality and quantity.

There is a modern nuance called data-constrained scaling: what happens when you run out of unique data before reaching the compute-optimal token count? Recent work (Muennighoff et al., 2023) shows that repeating data a small number of times (~4× at most) causes only modest degradation, but heavy repetition is clearly harmful. This matters for anyone trying to scale up on a limited corpus — repeating is better than stopping, but not as good as fresh data.

Scaling Laws

Bigger and more data both help — predictably. Scaling laws quantify how.

The defining empirical discovery of the LLM era is that loss falls as a smooth power law in three quantities: model size N (parameters), dataset size D (tokens), and compute C (FLOPs). Plot loss against any of them on log-log axes and you get a straight line over many orders of magnitude. This is what makes large training runs plannable — you can fit the curve on small, cheap runs and extrapolate to predict the loss of a run you cannot yet afford.

Kaplan et al. (2020) — scale works, and go big

OpenAI's original scaling-laws paper established the power-law form. Its key — and now controversial — claim: when compute is the bottleneck, you should spend most of it on a bigger model and comparatively less on more data. The paper showed that parameters scale more favorably than tokens at the compute-optimal frontier (given the experimental setup). This drove the "make it huge" era — GPT-3 at 175 billion parameters trained on only ~300 billion tokens. As a ratio that is roughly 1.7 tokens per parameter, far below the Chinchilla number.

Chinchilla (Hoffmann et al., 2022) — balance model and data

DeepMind revisited the question with a more careful experimental setup — running more controlled iso-FLOP experiments, fitting a richer functional form, and avoiding some statistical artifacts of the Kaplan setup. The conclusion: most large models of that era were significantly undertrained. For a fixed compute budget, model size and data should scale together, in roughly equal proportion. The famous rule of thumb:

Compute-optimal (Chinchilla rule): D ≈ 20 · N (tokens ≈ 20 × parameters) Examples: 1B-param model → ~20B tokens 7B-param model → ~140B tokens 70B-param model → ~1.4T tokens Chinchilla itself: 70B params × 1.4T tokens — it beat 280B-param Gopher trained on only 300B tokens, using the same compute budget.

The clean interpretation: if you have a fixed FLOP budget, the loss at both N/2 and 2× data versus 2N and N/2 data lands in the same neighborhood — you are on the iso-FLOP curve. But the Chinchilla point on that curve minimizes loss per FLOP.

log tokens D log params N C₁ C₂ C₃ Chinchilla optimal D ≈ 20N GPT-3 (175B, 300B) undertrained (Kaplan era) Chinchilla (70B, 1.4T) compute-optimal inference-optimal LLaMA-3 8B: ~1800 tokens/param Kaplan vs Chinchilla Kaplan (2020) Big N, less D ~1.7 tok/param Chinchilla (2022) Equal N and D ~20 tok/param Inference-optimal Small N, much more D >100–1800 tok/param Cheaper to serve!
Iso-FLOP curves (dashed) are hyperbolas in log(N) vs. log(D) space. Kaplan's finding placed GPT-3 in the upper-left (big model, few tokens). Chinchilla's finding: the loss-minimizing point on each iso-FLOP curve lies along the diagonal D ≈ 20N. The inference-optimal regime (lower-right, small model heavily trained) is where LLaMA-3 and Mistral live — traded training compute for cheaper serving cost.

The inference caveat — why LLaMA "overtrains"

Chinchilla optimizes for training compute. But a model you deploy is trained once and run billions of times. If you will serve it heavily, it pays to train a smaller model on far more data than compute-optimal — you spend extra FLOPs at training time to get a cheaper, faster, more memory-efficient model forever after. The marginal cost of extra training tokens is low compared to the cumulative inference cost of running a 2× larger model on every request.

This is exactly why LLaMA models are trained well past the 20× ratio: LLaMA-3 8B saw roughly 15 trillion tokens — about 1 875 tokens per parameter. Mistral 7B similarly. For an inference engineer, this is the key takeaway: the model you serve was sized for serving, not for a training-compute leaderboard. You will be running a "Chinchilla-overtrained" model, which means its weights pack more knowledge per parameter than a Chinchilla-optimal equivalent — a good thing for serving cost.

The FLOPs rule: C ≈ 6ND

You can estimate the total floating-point operations to train a dense Transformer with a beautifully simple rule: about 6 FLOPs per parameter per token. This factor breaks down into 2 for the forward-pass multiply-accumulate (one multiply + one add per weight, each time a token visits a parameter) and 4 for the backward pass, which is roughly twice the cost of the forward (it computes both the weight gradient and the input gradient).

Where do the 6 FLOPs per parameter per token come from? Forward pass 2 FLOPs 1 multiply + 1 add per weight per token Backward — weight gradient 2 FLOPs ∂L/∂W Backward — input gradient 2 FLOPs ∂L/∂x (for prev layer) Total ≈ 6 FLOPs per param per token → C ≈ 6 · N · D Approximation: ignores embeddings, biases, norms. Accurate to within ~10% for large dense Transformers.
The C ≈ 6ND rule decomposes as 2 FLOPs for the forward pass, 2 for the weight gradient (backward), and 2 for the input gradient (backward). The approximation ignores small terms (embedding layers, layer norms, biases) and is accurate to within ~10% for large dense Transformers.
C ≈ 6 · N · D N = parameters, D = training tokens GPT-3 sanity check (175B params, 300B tokens): C ≈ 6 × 175×10⁹ × 300×10⁹ = 3.15 × 10²³ FLOPs On an A100 (FP16 peak ~3.12×10¹⁴ FLOP/s) at a realistic ~40% MFU (~1.25×10¹⁴ sustained): single-GPU wall clock ≈ 3.15×10²³ / 1.25×10¹⁴ ≈ 2.5×10⁹ s ≈ 80 GPU-years on 1024 A100s ≈ 80 × 365 / 1024 ≈ 29 days ✓ (~the published ~1-month run)

You will use this constantly: to sanity-check published training costs, to estimate how long your own run will take, and to decide what is even feasible on your hardware. Commit C ≈ 6ND and D ≈ 20N to memory — together they let you plan any training run on the back of an envelope.

The C ≈ 6ND rule counts multiply-accumulate ops in the weight matrices, not activations or normalization. For a mixture-of-experts model where only a fraction k/E of experts activate per token, the effective count scales down accordingly — this is one reason MoE models (like Mixtral) can serve at much lower inference FLOP cost than their total parameter count suggests.

Putting It Together

A worked plan for the model we train tomorrow.

Let's apply today's tools to the Day 9 build. We want a tiny GPT of roughly N ≈ 10M parameters. What does the theory say?

Quantity Formula Day 9 value Notes
Compute-optimal tokens D = 20N ~200M tokens TinyShakespeare has ~300K — we are in multi-epoch territory
Training FLOPs C = 6ND ~1.2×10¹⁶ Seconds to minutes on any modern GPU or Apple Silicon
Expected final loss Power-law extrapolation ~1.4–1.6 nats Over a 65-char vocabulary; ppl ~4–5
Expected bits-per-byte bpb = loss / ln(2) ~2.0–2.3 bpb Higher than a BPE model — char-level is harder per byte

The key pedagogical contrast: our tiny run is in a "small data, many epochs" regime — not the single-epoch frontier world described by Chinchilla. That is deliberate. Running in a regime we can actually afford on a laptop teaches the mechanics. Watching the loss curve hit those expected values tomorrow will confirm the theory is real, not just math on paper.

One more planning calculation: what model could a real compute budget buy? With a single H100-day (~3.5×10¹⁹ FLOPs at ~40% utilization), the Chinchilla-optimal model is roughly N ≈ 540M params trained on D ≈ 10.8B tokens (since N≈sqrt(C/120), D≈20N). A week of H100 time gives you roughly a 1.4B-param model on ~28B tokens. These are sanity-check anchors worth remembering.

Exercise

Seven exercises, all in the notebook.

Companion notebook: day-8-pretraining.ipynb.

  1. Implement the LM loss. Write the shift-by-one cross-entropy from scratch (no F.cross_entropy) and verify it matches F.cross_entropy on a random (B, T, V) logit tensor. Use an assertion to confirm they agree to 1e-5.
  2. Loss → perplexity → bits-per-byte. Write the three conversion functions. Confirm that a uniform distribution over V tokens gives loss ln(V) and perplexity V. Compute bpb for a loss = 3.38 model with average 4 bytes/token.
  3. Pack a dataset. Take a text file, tokenize it (char-level is fine), concatenate into one stream with <eos> between documents, and reshape into (N, T) blocks. Write a get_batch() that returns random (x, y) pairs with y shifted by one.
  4. Scaling-law plot. Train a few tiny models of increasing width on the same data, record final loss, and plot loss vs. parameter count on log-log axes. Observe the (rough) straight line.
  5. FLOPs estimator. Write train_flops(N, D) using 6ND. Reproduce the GPT-3 estimate from scratch. Compute the FLOPs for your Day 9 model and divide by your hardware's throughput to predict wall-clock time. Check tomorrow.
  6. Chinchilla calculator. Given a compute budget C, solve for the compute-optimal N and D. What model and token count does a single H100-day buy? What about a week?
  7. Inference-optimal trade-off. Re-run the Chinchilla calculator with tokens_per_param = 200 (LLaMA-style overtraining). For the same compute budget, how much smaller is the model? Estimate the inference speedup from halving the model's parameter count.
Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you can't, re-read the relevant section.

  1. Write the causal LM objective as a factorized probability over a sequence. Why is it "self-supervised"?
  2. What is teacher forcing? What problem does it solve during training, and what problem does it create at inference?
  3. Define exposure bias. Name one technique that tries to address it.
  4. Convert a loss of L = 3.38 nats to perplexity and bits-per-token.
  5. Why do scaling-law papers prefer bits-per-byte over perplexity?
  6. What does a healthy training loss curve look like? Name two shapes that indicate a broken run.
  7. What is "packing" and why is it preferred over padding each document to a fixed length?
  8. State the Chinchilla rule of thumb. How does it differ from the Kaplan conclusion?
  9. Why do deployed models like LLaMA-3 8B train far past the compute-optimal token count?
  10. Estimate the training FLOPs for a 7B model trained on 2T tokens, and convert to H100 wall-clock time assuming 40% utilization.

"The model isn't memorizing the internet — it's compressing it. Compression to the entropy floor is the whole objective."

Day 8 · Pre-training
Further Reading

Go deeper.

Hand-picked references for objectives, data, and scaling laws.

Paper · 2020

Kaplan et al. — Scaling Laws for Neural LMs

The original power-law paper. Establishes loss vs. N, D, C and the "go big" conclusion.

Open paper
Paper · 2022

Hoffmann et al. — Chinchilla

Compute-optimal training. The 20 tokens/param rule and why Gopher was undertrained.

Open paper
Paper · 2020

Brown et al. — GPT-3

175B params, 300B tokens. The "make it huge" data point, and the scale-to-capability story.

Open paper
Dataset · 2024

FineWeb & the FineWeb report

A modern, well-documented open pre-training dataset and its full curation recipe.

Read report
Dataset · 2020

Gao et al. — The Pile

800 GB diverse text across 22 domains. A canonical, well-described open corpus.

Open paper
Blog · Karpathy

Deep Dive into LLMs like ChatGPT

3.5-hour overview; the pre-training and data sections are the best free explainer available.

Watch on YouTube
Course · Stanford

CS336 — LMs from Scratch

Data and scaling lectures go far deeper than we can here. Assignment 2 implements the full pipeline.

Open course
Paper · 2022

Lee et al. — Deduplicating Training Data

Why dedup makes models better. Concrete, measurable gains. The MinHash/LSH methodology.

Open paper
Paper · 2023

Muennighoff et al. — Scaling Data-Constrained LMs

What happens when you repeat data. Up to ~4× repeats is survivable; beyond that, quality degrades sharply.

Open paper
Dataset · 2024

Soldaini et al. — Dolma

The OLMo pre-training corpus. Fully open pipeline documentation — a rare and valuable reference.

Open paper