LLM Inference Engineer · Day 14
Day 14 · Week 2 · Training & Architectures
🎯

Alignment: RLHF, PPO, DPO

A capable pretrained + fine-tuned model still isn't necessarily helpful, honest, or harmless — because next-token likelihood says nothing about which response a human would prefer. Alignment is the final training stage that teaches a model to distinguish a good answer from a merely fluent one. Today you build intuition for reward modeling and the Bradley-Terry objective, understand why RLHF with PPO works and why it's brutally complex, then derive DPO — the elegant shortcut that collapses a four-model training rig into a single supervised loss. This is the Week 2 capstone.

Time~150 min
DifficultyHard
PrerequisiteDay 13
Why This Lesson

SFT teaches the model what an answer looks like. Alignment teaches it which answer is better.

The gap between capability and preference

After pretraining, a language model can generate plausible text about almost any topic. After supervised fine-tuning (Day 13), it can follow instructions and produce well-formatted responses. What it cannot do yet is distinguish a response that is truly helpful from one that is merely fluent, or distinguish a response that is honest from one that sounds confident but is wrong. The objective it was trained on — maximize next-token log-likelihood on human-curated demonstrations — gives it no direct signal about this.

Consider two responses to "What is the capital of Australia?" Both might be grammatically perfect. One says "Canberra." The other says "Sydney — it is the largest and most famous city." The second is wrong, but a next-token model trained on scraped text might assign it equal or higher probability because "Sydney" appears near "Australia" far more often in natural text. Likelihood and preference are not the same thing. Alignment is the engineering discipline that closes that gap.

Why this matters for inference engineers

You will rarely run alignment yourself — it is expensive, slow, and usually done once by the model's creator. But you must understand what the model you serve has been through:

  • An aligned model carries a learned notion of preference that shapes every generation, including refusals, hedges, and format choices.
  • The system prompt is the primary lever to activate or redirect that preference. Understanding alignment helps you write better system prompts.
  • Behavioral guardrails are a product of alignment, not magic. They can be incomplete, inconsistent, or miscalibrated — knowing why helps you debug generation quality issues.
  • The reference model is needed only at training time. At inference, you serve just the policy. Understanding this clarifies what "the model" is and why swapping system prompts doesn't change the weights.

Learning objectives

  1. Articulate why next-token likelihood is insufficient for alignment and why pairwise preferences are a better training signal.
  2. Lay out the three-stage pipeline: pretrain → SFT → preference optimization.
  3. Explain preference data collection: what a (prompt, chosen, rejected) triple looks like and why comparisons beat ratings.
  4. Implement and explain the Bradley-Terry reward-model objective.
  5. Describe RLHF with PPO: the KL penalty, reward hacking, and the four-model training rig.
  6. Derive the intuition behind DPO and write its loss with a plain reading.
  7. Contrast RLHF/PPO with DPO on complexity, stability, compute, and memory.
  8. Place RLAIF, Constitutional AI, GRPO, and DPO variants (IPO, KTO) as variations on the same theme.
  9. Connect alignment to inference: what the served model is, and how alignment affects decoding.
The Alignment Problem

Pretrained models are capable but uncalibrated. Likelihood ≠ preference.

A pretrained model learns the distribution of text on the internet. That distribution includes helpful answers, wrong answers, spam, sycophancy, toxic content, and everything in between. The model has no internal notion that one of these is better than another — it only knows which continuations are probable. SFT narrows the distribution to the format of good responses, but it still does not teach the model to rank candidate outputs.

The alignment problem, stated precisely: given a prompt x, there are many valid continuations. We want the model's generation policy to prefer the continuations a thoughtful human would prefer — what OpenAI called helpful, harmless, and honest (HHH). We need a training signal for this, and raw log-likelihood on demonstrations doesn't provide it because demonstrations only teach "what" not "which."

Why not just label more demonstrations?

A natural first instinct is to simply collect more high-quality demonstrations and fine-tune on them. This works partially — it is what SFT does. But it fails to scale for two reasons. First, writing a gold-standard response is expensive and slow; asking a human to compare two candidate responses is much faster and surprisingly consistent. Second, comparison is easier to do correctly: humans disagree on what an ideal answer looks like, but they agree much more on which of two given answers is better. Pairwise comparisons give denser signal per human-hour.

Pretrained + SFT model maximizes P(next token | context) ✓ Fluent, on-format ✗ May be verbose or evasive ✗ May be confidently wrong ✗ May be sycophantic ✗ No sense of "better" Alignment Aligned model (policy) maximizes E[r(x, y)] − β·KL ✓ Fluent, on-format ✓ Appropriately concise ✓ Honest about uncertainty ✓ Calibrated refusals ✓ Prefers responses humans prefer
The alignment gap. A pretrained + SFT model is fluent but has no notion of which response is better. Alignment trains the model to prefer responses that score high on human preferences, while keeping it from drifting into incoherence (the KL anchor).
The Pipeline

Three stages: pretrain, imitate, then learn what's preferred.

The canonical alignment recipe, introduced at scale by InstructGPT (Ouyang et al. 2022), has three stages. You already built the first two.

0 · Pretrain next-token prediction trillions of tokens web / books / code Days 8–9 1 · SFT imitate demonstrations (prompt, response) 10k–100k pairs Day 13 2 · Reward model learn preferences (chosen ≻ rejected) Bradley-Terry loss Today — RLHF path 3 · Policy opt. maximize reward − β·KL PPO ·or· DPO outputs the served policy Today — main topic DPO shortcut collapses stages 2 + 3 into one supervised loss
The four-stage full pipeline (stages 0–3). Classic RLHF trains an explicit reward model then optimizes with PPO. DPO (dashed gold arrow) fuses stages 2 and 3 into a single supervised loss on preference pairs — no separate reward model, no RL loop.
  1. Pretraining (Day 8–9). Train on trillions of tokens with next-token prediction. Builds world knowledge, language modeling, and zero-shot capabilities. The model at this stage is powerful but raw.
  2. Supervised fine-tuning / SFT (Day 13). Fine-tune on a curated set of (instruction, ideal response) demonstrations, often 10k–100k pairs. Teaches the model the format and style of helpful responses. Still no notion of ranking.
  3. Reward modeling + policy optimization (today). Collect human (or AI) pairwise comparisons. Train a reward model on them. Then update the language model to maximize that reward — either with PPO (RLHF) or directly with DPO. The output of this stage is the model you serve.
Preference Data

Humans rank pairs. The dataset is (prompt, chosen, rejected) triples.

Asking a human for an absolute quality score on a scale of 1–10 is noisy and poorly calibrated — one annotator's 7 is another's 5. Asking them to compare two responses — "is A better than B?" — is far more reliable and faster. So preference datasets are collections of triples: a prompt x, a chosen response y_w (the winner, "w" for "winning"), and a rejected response y_l (the loser, "l" for "losing").

Prompt x "Explain gradient descent in one sentence." two responses generated y_w (chosen) "Gradient descent iteratively adjusts parameters in the direction that reduces the loss most steeply." ✓ concise, accurate, one sentence Human prefers this response y_l (rejected) "Gradient descent is an optimization algorithm. It works by computing gradients. Gradients tell us the slope. We can then..." ✗ verbose, repetitive, misses the instruction Human disprefers this response One triple = (x, y_w, y_l) · datasets contain 10k–300k such triples
A preference data triple. The annotator sees both responses and marks one as preferred. Notice the instruction says "one sentence" — the rejected response violates it. This kind of violation is easy for humans to judge even when they couldn't write a perfect answer themselves.

In practice, labelers are given rubrics that operationalize helpfulness, harmlessness, and honesty. Responses are compared on criteria like accuracy, instruction-following, appropriate length, and safety. Multiple annotators often rate the same pair, and inter-annotator agreement is tracked. Some datasets (Anthropic HH, OpenAI WebGPT comparisons, UltraFeedback) are public. The quality and coverage of this preference data is the single biggest lever on final model quality — more than the specific algorithm used in stage 3.

Increasingly, AI models (strong frontier LLMs) serve as labelers — this is RLAIF (Reinforcement Learning from AI Feedback). This scales data collection by orders of magnitude but introduces a circularity risk: you are using one model's preferences to train another. Careful calibration and human spot-checking are essential.

Reward Modeling

Train a model to score responses. The Bradley-Terry objective.

The reward model r_φ(x, y) is a neural network that takes a (prompt, response) pair and outputs a single scalar — the "quality" of that response to that prompt. It is usually the SFT model with its language-modeling head (the unembedding matrix over vocabulary) replaced by a single linear projection to a scalar. This gives it a head start: it already understands language and instruction-following before it learns to rank.

The Bradley-Terry preference model

The Bradley-Terry model is a classical statistical model for pairwise comparisons. Its key assertion: the probability that y_w beats y_l in a comparison depends only on the difference of their underlying "strength" scores. Applied to rewards:

P(y_w ≻ y_l | x) = σ( r(x, y_w) − r(x, y_l) ) where σ(z) = 1 / (1 + e^{−z}) [the logistic sigmoid] Worked example: r(x, y_w) = 2.0, r(x, y_l) = 0.5 difference = 1.5 P(y_w wins) = σ(1.5) = 1/(1+e^{-1.5}) ≈ 0.818 So the reward model is 82% confident the chosen response is better. If the margin is 0, it is 50%: indifferent. If negative, it prefers the "rejected" one.

The reward-model loss maximizes this probability over all training triples — equivalently, minimizes the negative log-probability:

L_RM(φ) = − E_{(x, y_w, y_l) ~ D}[ log σ( r_φ(x, y_w) − r_φ(x, y_l) ) ] This is binary cross-entropy applied to the reward difference. The loss pushes r(x, y_w) − r(x, y_l) to be large and positive for every training triple.
prompt + y_w (chosen) Transformer SFT backbone [EOS] embedding Scalar head r_w = 2.0 prompt + y_l (rejected) Transformer SFT backbone [EOS] embedding Scalar head r_l = 0.5 Δ = 1.5 σ(1.5) ≈ 0.82 Loss = −log σ(r_w − r_l) = −log σ(1.5) ≈ 0.20 Backprop pushes r_w up and r_l down until the margin is large. Both paths share the same Transformer weights — scored in two forward passes.
Reward model forward pass. The same SFT-backbone transformer processes (prompt + chosen) and (prompt + rejected) separately, outputting scalars r_w and r_l. The Bradley-Terry loss minimizes −log σ(r_w − r_l), pushing the chosen score higher and the rejected score lower on every training step.

After training, the reward model is a fully automatable proxy for human judgment. Feed it any (prompt, response) pair and get back a quality scalar in milliseconds — far faster than asking a human. This is what makes stage 3 possible: you can generate millions of responses from the evolving policy and score them without a human in the loop. The reward model's fidelity to human preferences is the bottleneck for alignment quality, which is why preference data curation is so carefully managed at frontier labs.

The reward model sees the same text as the language model but predicts one number instead of a distribution over tokens. This architectural simplicity is the point — it collapses the complexity of preference into a single comparable score. The famous Goodhart's Law applies immediately: the reward model is only a proxy for human preference. Any model trained to maximize a proxy will eventually find edge cases where high proxy score ≠ high actual quality. This is the origin of reward hacking.

RLHF with PPO

Treat generation as RL: the LLM is a policy, the reward model is the environment.

With a reward model trained, stage 3 of RLHF frames text generation as a reinforcement learning problem. The language model is the policy π_θ. Generating a response is taking a sequence of actions — one token at a time — each drawn from the policy's probability distribution. The episode ends when the response is complete; the reward model then scores the entire response and returns a scalar reward. We want to optimize θ so the expected reward increases.

The PPO objective with KL penalty

Using raw RL (REINFORCE or vanilla policy gradient) on language models is disastrously unstable — the policy wanders into incoherence chasing marginal reward improvements. Proximal Policy Optimization (PPO) addresses this by clipping large policy updates so no single gradient step moves too far. But for LLMs there is a second, crucial stabilizer: the KL penalty:

Maximize J(θ) = E_{x ~ D, y ~ π_θ(·|x)}[ r_φ(x, y) ] − β · KL( π_θ(y|x) ‖ π_ref(y|x) ) π_θ = the policy we are training (the LLM) π_ref = the frozen SFT model (the reference) r_φ = the trained reward model β = KL coefficient (typically 0.01–0.1) KL(P‖Q) = Σ_y P(y) log P(y)/Q(y) [per-token, summed over the sequence] Worked example: r_φ(x, y) = 3.2, KL = 2.1, β = 0.1 J = 3.2 − 0.1·2.1 = 3.2 − 0.21 = 2.99 The policy gains from high reward but is penalized for drifting from the reference.

The KL term is computed per token, summed over the sequence, and added to the reward at each step. In practice the total reward for a complete response is:

R(x, y) = r_φ(x, y) − β · Σ_{t=1}^{T} log[ π_θ(y_t | x, y_{<t}) / π_ref(y_t | x, y_{<t}) ]

Why the KL term is non-negotiable

Without the KL penalty, the policy will reward-hack. The reward model is an imperfect proxy — it has blind spots, biases, and systematic errors introduced by the preference labelers. A relentless optimizer like PPO discovers those blind spots and exploits them. Classic reward-hacking failure modes:

  • Length bias. If the reward model has a latent preference for longer responses (because humans often perceive longer as more thorough), the policy learns to pad responses with content-free repetition.
  • Sycophancy. If human labelers preferred responses that agreed with their (sometimes wrong) prior beliefs, the policy learns to tell users what they want to hear.
  • Safe non-answers. Refusals are always "safe" in the sense that they are unlikely to contain factual errors. If refusal is rewarded over an uncertain answer, the policy learns to refuse too broadly.

The KL penalty confines the policy to a neighborhood of the SFT reference. Within that neighborhood, real improvements are possible (higher reward without dramatic drift). Outside it, the policy is penalized — which prevents it from exploiting reward-model blind spots that exist far from the sensible SFT distribution.

Prompt x from dataset Policy π_θ generates y ~ π_θ(·|x) updated by PPO Reward model r_φ scores the response frozen after stage 2 PPO update reward − β·KL → gradient for π_θ gradient update to π_θ (PPO clip) Reference π_ref frozen SFT model never updated KL( π_θ ‖ π_ref ) computed here Critic V value network 4th model 4 models in memory simultaneously: policy · reference · reward model · critic (value network)
The RLHF/PPO training loop. The policy generates responses on-policy; the reward model scores them; the PPO update uses the reward minus a KL penalty relative to the frozen reference. A separate critic (value) network estimates baselines. All four models must fit in GPU memory simultaneously — the primary reason RLHF is expensive.

Operational complexity

RLHF with PPO is operationally brutal. Four models must fit in GPU memory simultaneously: the policy π_θ (being trained), the frozen reference π_ref, the frozen reward model r_φ, and a separate critic / value network V_ψ used to estimate advantage baselines for PPO. For a 7B-parameter model, that is 4× the base memory footprint. The RL is notoriously hyperparameter-sensitive — KL coefficient β, PPO clip ratio, learning rate schedule, rollout length all interact in nonlinear ways. Training instabilities, reward hacking, and mode collapse are common failure modes. This is the most finicky part of the entire LLM stack.

DPO

Skip the reward model and the RL loop. Optimize preferences directly.

Direct Preference Optimization (DPO, Rafailov et al. 2023) is the result that reshaped alignment. The key insight: the RLHF objective — maximize reward under a KL constraint — has a closed-form optimal policy. Given a reward function r and reference model π_ref, the policy that maximizes the constrained objective is:

π*(y|x) ∝ π_ref(y|x) · exp( r(x,y) / β ) This is the Boltzmann/softmax distribution over responses, weighted by reward. The key: we can rearrange this to express r in terms of π and π_ref: r(x,y) = β · log[ π*(y|x) / π_ref(y|x) ] + β · log Z(x) where Z(x) is a partition function (the normalizing constant, same for y_w and y_l). Substituting into the Bradley-Terry loss — Z(x) cancels! — gives:
L_DPO(θ) = − E_{(x, y_w, y_l) ~ D}[ log σ( β · ( log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x) ) ) ] log π_θ(y_w|x)/π_ref(y_w|x) = log-ratio of policy vs ref on CHOSEN log π_θ(y_l|x)/π_ref(y_l|x) = log-ratio of policy vs ref on REJECTED β = temperature / KL anchor strength

Plain reading: the DPO loss pushes the policy to assign relatively higher probability (relative to the reference) to the chosen response and relatively lower probability to the rejected response. The reference model provides the baseline. The β parameter plays the same KL-anchoring role as in PPO — it is now baked directly into the loss rather than added as a separate term. The "implicit reward" of a DPO-trained policy is β · log(π_θ / π_ref).

What DPO eliminates

DPO needs:

  • No separate reward model. The reward is implicit in the policy ratio. Stage 2 of the RLHF pipeline disappears entirely.
  • No on-policy sampling. The preference pairs are fixed training data, just like SFT. No need to generate responses during training.
  • No RL algorithm. The loss is a standard supervised (cross-entropy-like) loss. Stable, predictable, easy to implement.
  • Two models in memory (policy + frozen reference), not four.
RLHF + PPO Preference pairs (x, y_w, y_l) Train reward model r_φ (Bradley-Terry loss) Sample y ~ π_θ (on-policy generation) PPO update: r_φ(x,y) − β·KL( π_θ ‖ π_ref ) Aligned policy π_θ (4 models used) DPO Preference pairs (x, y_w, y_l) DPO loss −log σ(β·(log π_θ/π_ref|_w − log π_θ/π_ref|_l)) supervised loss on fixed preference pairs Aligned policy π_θ (2 models used)
RLHF/PPO versus DPO. RLHF requires four sequential steps including a separate reward model and on-policy sampling. DPO collapses this to a single supervised loss on the same preference pairs — the reward model is implicit in the log-probability ratio.

The practical difference is enormous. DPO is stable, debuggable, and fast to iterate. For most teams — running models in the 7B–70B range, with hundreds of thousands of preference pairs — DPO or one of its close variants is the default. RLHF/PPO is still used at the frontier where its extra control (and ability to use a separately trained reward model for filtering) is worth the pain. But DPO democratized alignment the way LoRA democratized fine-tuning.

DimensionRLHF + PPODPO
Reward model needed?Yes — separate, explicitly trainedNo — implicit in log-ratio
RL loop required?Yes — on-policy PPONo — supervised loss
On-policy sampling?Yes — throughout trainingNo — fixed preference pairs
Models in memory4 (policy, ref, RM, critic)2 (policy, frozen ref)
Training stabilityFinicky — many hyperparametersStable — like fine-tuning
Implementation complexityHighLow
Output quality (frontier)Competitive, max controlSlightly below PPO at scale
Typical adoptersOpenAI, Anthropic, DeepMindMost open-weight models
The Wider Family

RLAIF, GRPO, Constitutional AI, KTO — the same principle, different angles.

Alignment is one of the most active research areas in ML. The following variations appear constantly in papers, blog posts, and model cards. The core pattern is always the same: a preference signal (from humans, AI, or verifiers) and a training algorithm that uses it to push the policy toward preferred behavior. The axes of variation are where the signal comes from and how you optimize against it.

Variations in the feedback source

  • RLAIF (Reinforcement Learning from AI Feedback). Replace expensive human preference labels with judgments from a strong LLM (often a "constitutional" model). Scales preference data collection by 100×; most modern post-training mixes human and AI feedback. The risk is feedback laundering — the AI labeler's biases become the policy's biases.
  • Constitutional AI (Anthropic, Bai et al. 2022). A structured form of RLAIF where the AI critiques and revises its own outputs against a written set of principles (a "constitution"), then generates preference data from the revised outputs. The constitution makes the value system explicit and auditable — a key advantage over opaque human labeling.
  • RL with verifiable rewards. For tasks with ground-truth answers — math, code, formal logic — you can reward the model directly for correctness without a trained reward model. This is the basis of reasoning-model training (DeepSeek-R1, o1-style). GRPO (below) is the typical optimization algorithm here.

Variations in the optimization algorithm

  • GRPO (Group Relative Policy Optimization, Shao et al. 2024). A PPO variant that drops the separate critic/value network. Instead of computing a value baseline from a separate model, GRPO samples a group of responses to the same prompt and estimates the advantage of each response relative to the group mean. This cuts memory from 4 models to 3 and is more stable. Used heavily in DeepSeek-R1 and other reasoning models.
  • IPO (Identity Preference Optimization). A DPO variant that adds a regularization term to prevent overfitting to the preference pairs. When DPO is trained for many steps, the policy can overfit to the specific chosen/rejected pairs in the dataset; IPO's regularizer prevents this.
  • KTO (Kahneman-Tversky Optimization). A DPO variant that relaxes the requirement for paired comparisons. KTO can train on unpaired good/bad labels — "response A was good" without needing a corresponding bad response. Named after Kahneman-Tversky prospect theory because it models the asymmetry in how humans weight gains vs losses.
  • ORPO (Odds Ratio Preference Optimization). Folds preference optimization directly into SFT by modifying the standard language modeling loss. No need for a separate stage 3 at all — you run SFT and preference optimization in a single training run.

The throughline: understand the reward-model + KL-anchored-policy core and every one of these is a recognizable variation. The taxonomy of future methods will also fit this frame.

Inference Relevance

Only the policy is served. Alignment lives in the weights — and in the system prompt.

As an inference engineer, here is what alignment means concretely for your work:

What you actually serve

At inference time, you serve only the policy π_θ. The reference model π_ref is needed only to compute the KL penalty during training — at inference, it is gone. The reward model r_φ is similarly training-only. So a 7B-aligned model served in production is exactly one 7B model. No overhead from the training apparatus.

How alignment shapes generation

Alignment is baked into the weights through gradient updates on the preference pairs. Every generation from the aligned policy reflects learned preferences: which formats to use, how to hedge uncertainty, when to refuse, how to be concise. These behaviors are not enforced at inference — they emerge from the model's probability distribution over tokens. You cannot "turn them off" via decoding without significant effort (jailbreaks exist but are a cat-and-mouse game).

The system prompt as a soft amplifier

The system prompt is your primary lever. The alignment training teaches a space of behaviors; the system prompt steers the model to a region of that space. A well-written system prompt can encourage more concise responses, different tones, or more conservative / liberal handling of edge cases. But it cannot override the fundamental alignment — it is steering within the learned distribution, not outside it. This is why writing system prompts requires understanding what the model was aligned on.

Decoding and alignment interaction

  • Temperature. Lower temperature sharpens the distribution — makes the policy behave more like argmax over its learned preferences. Often improves alignment reliability at the cost of diversity.
  • Top-p / Top-k. Truncating the tail of the distribution removes low-probability (and often high-entropy) continuations. For aligned models this often improves helpfulness by preventing drift to rare, poorly-aligned tokens.
  • Repetition penalty. Can interact with length-aligned models (those trained to avoid verbosity) unpredictably — sometimes amplifying the reward-hacking artifacts from imperfect alignment.

Serving the reference model at training time

If you are building an alignment pipeline (e.g. with TRL or OpenRLHF), you will need to serve the reference model alongside the policy. For DPO, both forward passes can be batched together since the reference is frozen — some frameworks interleave them in the same batch to save memory. For RLHF, the reference model needs to run on each sampled response, which roughly doubles inference throughput requirements during training.

The DPO implicit reward β · log(π_θ(y|x) / π_ref(y|x)) can be computed at inference time to score candidate responses without a separate reward model. Some production systems use this as a reranking signal: generate K responses, score each with the implicit reward, return the highest-scoring one. This is a cheap way to get reward-model-like behavior at inference with zero additional model parameters.

Exercise

Seven exercises in the notebook.

Companion notebook: day-14-alignment.ipynb. All exercises run on CPU in seconds.

  1. Bradley-Terry reward loss. Implement bt_loss(r_chosen, r_rejected) and verify: as the margin grows, the loss falls; at margin 0 the loss is ln 2; at margin 3 it is near 0.
  2. Train a toy reward model. On synthetic "shorter is better" preferences, train a scalar reward head and measure preference accuracy on held-out pairs. Plot the training loss curve.
  3. Implement the DPO loss. Write dpo_loss(pol_chosen_logp, pol_rejected_logp, ref_chosen_logp, ref_rejected_logp, beta) from scratch using log-ratios. Verify it equals ln 2 when policy equals reference.
  4. DPO on toy preferences. Initialize a tiny toy policy at the same distribution as a frozen reference. Run DPO on synthetic preference pairs. Show that the chosen-vs-rejected log-prob margin grows over training steps.
  5. KL anchoring and the β sweep. Sweep β over [0.05, 0.1, 0.5, 1.0]. Show that small β allows larger drift from the reference and larger margin, while large β stays close but learns slowly. Plot drift vs margin.
  6. Per-token KL penalty. Compute the per-token KL between two distributions and show it sums to the sequence-level KL. Verify numerically.
  7. Reward hacking demo. Build a flawed reward (rewards length). Show that maximizing reward without a KL anchor degrades the policy to pathological outputs. Add the KL anchor and show it stays near the reference.
Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you cannot, re-read the relevant section.

  1. Why is next-token log-likelihood not a sufficient training objective for alignment? Give a concrete example where high likelihood and high preference diverge.
  2. What are the three (or four) stages of the canonical alignment pipeline? What does each stage produce?
  3. Why are pairwise comparisons better training data than absolute quality ratings? Name two practical reasons.
  4. Write the Bradley-Terry reward-model loss and explain every term. What does it optimize?
  5. In RLHF, identify: the policy, the reference model, the reward model, and the critic. What does each do? Which are frozen?
  6. What is reward hacking? Give two realistic failure modes. How does the KL penalty mitigate each?
  7. Derive, in one sentence, why DPO can skip the explicit reward model. What mathematical fact makes this possible?
  8. Write the DPO loss and explain the role of β. What is the "implicit reward" in a DPO-trained policy?
  9. How many models does RLHF/PPO require in memory? How many does DPO require? What are they?
  10. At inference time: does the reference model need to be served? What is the implicit reward and how could you use it for reranking?
Week 2 Wrap-Up

You can now train an LLM — and understand every model you'll ever serve.

What you've covered

  • The pre-training objective, data pipeline, and scaling laws (C ≈ 6ND, Chinchilla) — Days 8–9.
  • A tiny GPT trained end-to-end, from forward pass to gradient — Day 9 capstone.
  • A production training loop: mixed precision, gradient clipping, accumulation, schedules, MFU — Day 10.
  • Distributed training: DP, FSDP/ZeRO, tensor and pipeline parallelism — Day 11.
  • Modern architecture: RoPE, RMSNorm, SwiGLU, GQA/MQA, MoE — Day 12.
  • Fine-tuning: SFT, LoRA, QLoRA, and how adapters are served — Day 13.
  • Alignment: preference data, reward modeling (Bradley-Terry), RLHF/PPO, and DPO — Day 14.

What's next — Week 3

We pivot from training to inference and hardware — the heart of this course. Prefill versus decode, GPU architecture and the roofline model, your first CUDA kernel, Apple Silicon and MLX, the KV cache in depth, and FlashAttention. Everything you built in Weeks 1–2 becomes the thing you now make fast.

If you want to consolidate, the best exercise is to take your Day 9 GPT, upgrade it with Day 12's components (RoPE + RMSNorm + SwiGLU + GQA), fine-tune it with a Day 13 LoRA adapter, and align it on toy preferences with a Day 14 DPO loss. End to end, your own hands, every line. That is the entire training stack — and you own it.

"The model does exactly what you rewarded — not what you wanted. Alignment is the careful art of making those two things the same. DPO showed that you don't need a separate reward model to do it."

Day 14 · Week 2 wrap
Further Reading

Go deeper.

The alignment canon.

Paper · 2022

Ouyang et al. — InstructGPT

The three-stage SFT → RM → PPO pipeline, at scale. The blueprint for modern assistant models.

Open paper
Paper · 2023

Rafailov et al. — DPO

"Your Language Model is Secretly a Reward Model." The closed-form derivation and its implications.

Open paper
Paper · 2017

Christiano et al. — Deep RL from Human Preferences

The original preference-based RL paper. The conceptual foundation that InstructGPT scaled.

Open paper
Paper · 2017

Schulman et al. — PPO

Proximal Policy Optimization. The RL algorithm behind RLHF — understand the clip objective and the value network.

Open paper
Paper · 2022

Bai et al. — Constitutional AI

RLAIF: alignment with AI feedback against a written constitution. Scales preference data, makes values explicit.

Open paper
Paper · 2024

Shao et al. — DeepSeekMath (GRPO)

Group Relative Policy Optimization: removes the separate critic, uses group-sampled baselines. Key to reasoning models.

Open paper
Library · HF

huggingface/trl

Production SFT, reward modeling, PPO, DPO, KTO, GRPO trainers. The fastest path from theory to working code.

View repo
Book · Lambert

Nathan Lambert — The RLHF Book

A free, current, book-length treatment of the whole alignment stack from preference data to production.

Read online
Paper · 2025

DeepSeek-R1

Large-scale reasoning model trained with RL (GRPO) on verifiable rewards — the frontier of alignment meeting inference engineering.

Open paper