A capable pretrained + fine-tuned model still isn't necessarily helpful, honest, or harmless — because next-token likelihood says nothing about which response a human would prefer. Alignment is the final training stage that teaches a model to distinguish a good answer from a merely fluent one. Today you build intuition for reward modeling and the Bradley-Terry objective, understand why RLHF with PPO works and why it's brutally complex, then derive DPO — the elegant shortcut that collapses a four-model training rig into a single supervised loss. This is the Week 2 capstone.
After pretraining, a language model can generate plausible text about almost any topic. After supervised fine-tuning (Day 13), it can follow instructions and produce well-formatted responses. What it cannot do yet is distinguish a response that is truly helpful from one that is merely fluent, or distinguish a response that is honest from one that sounds confident but is wrong. The objective it was trained on — maximize next-token log-likelihood on human-curated demonstrations — gives it no direct signal about this.
Consider two responses to "What is the capital of Australia?" Both might be grammatically perfect. One says "Canberra." The other says "Sydney — it is the largest and most famous city." The second is wrong, but a next-token model trained on scraped text might assign it equal or higher probability because "Sydney" appears near "Australia" far more often in natural text. Likelihood and preference are not the same thing. Alignment is the engineering discipline that closes that gap.
You will rarely run alignment yourself — it is expensive, slow, and usually done once by the model's creator. But you must understand what the model you serve has been through:
A pretrained model learns the distribution of text on the internet. That distribution includes helpful answers, wrong answers, spam, sycophancy, toxic content, and everything in between. The model has no internal notion that one of these is better than another — it only knows which continuations are probable. SFT narrows the distribution to the format of good responses, but it still does not teach the model to rank candidate outputs.
The alignment problem, stated precisely: given a prompt x, there are many valid continuations. We want the model's generation policy to prefer the continuations a thoughtful human would prefer — what OpenAI called helpful, harmless, and honest (HHH). We need a training signal for this, and raw log-likelihood on demonstrations doesn't provide it because demonstrations only teach "what" not "which."
A natural first instinct is to simply collect more high-quality demonstrations and fine-tune on them. This works partially — it is what SFT does. But it fails to scale for two reasons. First, writing a gold-standard response is expensive and slow; asking a human to compare two candidate responses is much faster and surprisingly consistent. Second, comparison is easier to do correctly: humans disagree on what an ideal answer looks like, but they agree much more on which of two given answers is better. Pairwise comparisons give denser signal per human-hour.
The canonical alignment recipe, introduced at scale by InstructGPT (Ouyang et al. 2022), has three stages. You already built the first two.
Asking a human for an absolute quality score on a scale of 1–10 is noisy and poorly calibrated — one annotator's 7 is another's 5. Asking them to compare two responses — "is A better than B?" — is far more reliable and faster. So preference datasets are collections of triples: a prompt x, a chosen response y_w (the winner, "w" for "winning"), and a rejected response y_l (the loser, "l" for "losing").
In practice, labelers are given rubrics that operationalize helpfulness, harmlessness, and honesty. Responses are compared on criteria like accuracy, instruction-following, appropriate length, and safety. Multiple annotators often rate the same pair, and inter-annotator agreement is tracked. Some datasets (Anthropic HH, OpenAI WebGPT comparisons, UltraFeedback) are public. The quality and coverage of this preference data is the single biggest lever on final model quality — more than the specific algorithm used in stage 3.
Increasingly, AI models (strong frontier LLMs) serve as labelers — this is RLAIF (Reinforcement Learning from AI Feedback). This scales data collection by orders of magnitude but introduces a circularity risk: you are using one model's preferences to train another. Careful calibration and human spot-checking are essential.
The reward model r_φ(x, y) is a neural network that takes a (prompt, response) pair and outputs a single scalar — the "quality" of that response to that prompt. It is usually the SFT model with its language-modeling head (the unembedding matrix over vocabulary) replaced by a single linear projection to a scalar. This gives it a head start: it already understands language and instruction-following before it learns to rank.
The Bradley-Terry model is a classical statistical model for pairwise comparisons. Its key assertion: the probability that y_w beats y_l in a comparison depends only on the difference of their underlying "strength" scores. Applied to rewards:
The reward-model loss maximizes this probability over all training triples — equivalently, minimizes the negative log-probability:
r_w and r_l. The Bradley-Terry loss minimizes −log σ(r_w − r_l), pushing the chosen score higher and the rejected score lower on every training step.After training, the reward model is a fully automatable proxy for human judgment. Feed it any (prompt, response) pair and get back a quality scalar in milliseconds — far faster than asking a human. This is what makes stage 3 possible: you can generate millions of responses from the evolving policy and score them without a human in the loop. The reward model's fidelity to human preferences is the bottleneck for alignment quality, which is why preference data curation is so carefully managed at frontier labs.
The reward model sees the same text as the language model but predicts one number instead of a distribution over tokens. This architectural simplicity is the point — it collapses the complexity of preference into a single comparable score. The famous Goodhart's Law applies immediately: the reward model is only a proxy for human preference. Any model trained to maximize a proxy will eventually find edge cases where high proxy score ≠ high actual quality. This is the origin of reward hacking.
With a reward model trained, stage 3 of RLHF frames text generation as a reinforcement learning problem. The language model is the policy π_θ. Generating a response is taking a sequence of actions — one token at a time — each drawn from the policy's probability distribution. The episode ends when the response is complete; the reward model then scores the entire response and returns a scalar reward. We want to optimize θ so the expected reward increases.
Using raw RL (REINFORCE or vanilla policy gradient) on language models is disastrously unstable — the policy wanders into incoherence chasing marginal reward improvements. Proximal Policy Optimization (PPO) addresses this by clipping large policy updates so no single gradient step moves too far. But for LLMs there is a second, crucial stabilizer: the KL penalty:
The KL term is computed per token, summed over the sequence, and added to the reward at each step. In practice the total reward for a complete response is:
Without the KL penalty, the policy will reward-hack. The reward model is an imperfect proxy — it has blind spots, biases, and systematic errors introduced by the preference labelers. A relentless optimizer like PPO discovers those blind spots and exploits them. Classic reward-hacking failure modes:
The KL penalty confines the policy to a neighborhood of the SFT reference. Within that neighborhood, real improvements are possible (higher reward without dramatic drift). Outside it, the policy is penalized — which prevents it from exploiting reward-model blind spots that exist far from the sensible SFT distribution.
RLHF with PPO is operationally brutal. Four models must fit in GPU memory simultaneously: the policy π_θ (being trained), the frozen reference π_ref, the frozen reward model r_φ, and a separate critic / value network V_ψ used to estimate advantage baselines for PPO. For a 7B-parameter model, that is 4× the base memory footprint. The RL is notoriously hyperparameter-sensitive — KL coefficient β, PPO clip ratio, learning rate schedule, rollout length all interact in nonlinear ways. Training instabilities, reward hacking, and mode collapse are common failure modes. This is the most finicky part of the entire LLM stack.
Direct Preference Optimization (DPO, Rafailov et al. 2023) is the result that reshaped alignment. The key insight: the RLHF objective — maximize reward under a KL constraint — has a closed-form optimal policy. Given a reward function r and reference model π_ref, the policy that maximizes the constrained objective is:
Plain reading: the DPO loss pushes the policy to assign relatively higher probability (relative to the reference) to the chosen response and relatively lower probability to the rejected response. The reference model provides the baseline. The β parameter plays the same KL-anchoring role as in PPO — it is now baked directly into the loss rather than added as a separate term. The "implicit reward" of a DPO-trained policy is β · log(π_θ / π_ref).
DPO needs:
The practical difference is enormous. DPO is stable, debuggable, and fast to iterate. For most teams — running models in the 7B–70B range, with hundreds of thousands of preference pairs — DPO or one of its close variants is the default. RLHF/PPO is still used at the frontier where its extra control (and ability to use a separately trained reward model for filtering) is worth the pain. But DPO democratized alignment the way LoRA democratized fine-tuning.
| Dimension | RLHF + PPO | DPO |
|---|---|---|
| Reward model needed? | Yes — separate, explicitly trained | No — implicit in log-ratio |
| RL loop required? | Yes — on-policy PPO | No — supervised loss |
| On-policy sampling? | Yes — throughout training | No — fixed preference pairs |
| Models in memory | 4 (policy, ref, RM, critic) | 2 (policy, frozen ref) |
| Training stability | Finicky — many hyperparameters | Stable — like fine-tuning |
| Implementation complexity | High | Low |
| Output quality (frontier) | Competitive, max control | Slightly below PPO at scale |
| Typical adopters | OpenAI, Anthropic, DeepMind | Most open-weight models |
Alignment is one of the most active research areas in ML. The following variations appear constantly in papers, blog posts, and model cards. The core pattern is always the same: a preference signal (from humans, AI, or verifiers) and a training algorithm that uses it to push the policy toward preferred behavior. The axes of variation are where the signal comes from and how you optimize against it.
The throughline: understand the reward-model + KL-anchored-policy core and every one of these is a recognizable variation. The taxonomy of future methods will also fit this frame.
As an inference engineer, here is what alignment means concretely for your work:
At inference time, you serve only the policy π_θ. The reference model π_ref is needed only to compute the KL penalty during training — at inference, it is gone. The reward model r_φ is similarly training-only. So a 7B-aligned model served in production is exactly one 7B model. No overhead from the training apparatus.
Alignment is baked into the weights through gradient updates on the preference pairs. Every generation from the aligned policy reflects learned preferences: which formats to use, how to hedge uncertainty, when to refuse, how to be concise. These behaviors are not enforced at inference — they emerge from the model's probability distribution over tokens. You cannot "turn them off" via decoding without significant effort (jailbreaks exist but are a cat-and-mouse game).
The system prompt is your primary lever. The alignment training teaches a space of behaviors; the system prompt steers the model to a region of that space. A well-written system prompt can encourage more concise responses, different tones, or more conservative / liberal handling of edge cases. But it cannot override the fundamental alignment — it is steering within the learned distribution, not outside it. This is why writing system prompts requires understanding what the model was aligned on.
If you are building an alignment pipeline (e.g. with TRL or OpenRLHF), you will need to serve the reference model alongside the policy. For DPO, both forward passes can be batched together since the reference is frozen — some frameworks interleave them in the same batch to save memory. For RLHF, the reference model needs to run on each sampled response, which roughly doubles inference throughput requirements during training.
The DPO implicit reward β · log(π_θ(y|x) / π_ref(y|x)) can be computed at inference time to score candidate responses without a separate reward model. Some production systems use this as a reranking signal: generate K responses, score each with the implicit reward, return the highest-scoring one. This is a cheap way to get reward-model-like behavior at inference with zero additional model parameters.
Companion notebook: day-14-alignment.ipynb. All exercises run on CPU in seconds.
bt_loss(r_chosen, r_rejected) and verify: as the margin grows, the loss falls; at margin 0 the loss is ln 2; at margin 3 it is near 0.dpo_loss(pol_chosen_logp, pol_rejected_logp, ref_chosen_logp, ref_rejected_logp, beta) from scratch using log-ratios. Verify it equals ln 2 when policy equals reference.Close the page and answer from memory. If you cannot, re-read the relevant section.
C ≈ 6ND, Chinchilla) — Days 8–9.We pivot from training to inference and hardware — the heart of this course. Prefill versus decode, GPU architecture and the roofline model, your first CUDA kernel, Apple Silicon and MLX, the KV cache in depth, and FlashAttention. Everything you built in Weeks 1–2 becomes the thing you now make fast.
If you want to consolidate, the best exercise is to take your Day 9 GPT, upgrade it with Day 12's components (RoPE + RMSNorm + SwiGLU + GQA), fine-tune it with a Day 13 LoRA adapter, and align it on toy preferences with a Day 14 DPO loss. End to end, your own hands, every line. That is the entire training stack — and you own it.
"The model does exactly what you rewarded — not what you wanted. Alignment is the careful art of making those two things the same. DPO showed that you don't need a separate reward model to do it."
The alignment canon.
The three-stage SFT → RM → PPO pipeline, at scale. The blueprint for modern assistant models.
Open paper"Your Language Model is Secretly a Reward Model." The closed-form derivation and its implications.
Open paperThe original preference-based RL paper. The conceptual foundation that InstructGPT scaled.
Open paperProximal Policy Optimization. The RL algorithm behind RLHF — understand the clip objective and the value network.
Open paperRLAIF: alignment with AI feedback against a written constitution. Scales preference data, makes values explicit.
Open paperGroup Relative Policy Optimization: removes the separate critic, uses group-sampled baselines. Key to reasoning models.
Open paperProduction SFT, reward modeling, PPO, DPO, KTO, GRPO trainers. The fastest path from theory to working code.
View repoA free, current, book-length treatment of the whole alignment stack from preference data to production.
Read onlineLarge-scale reasoning model trained with RL (GRPO) on verifiable rewards — the frontier of alignment meeting inference engineering.
Open paper