Day 14 — Alignment: RLHF, PPO, DPO · LLM Inference Engineer Curriculum

Why This Lesson

SFT teaches the model what an answer looks like. Alignment teaches it which answer is better.

The gap between capability and preference

After pretraining, a language model can generate plausible text about almost any topic. After supervised fine-tuning (Day 13), it can follow instructions and produce well-formatted responses. What it cannot do yet is distinguish a response that is truly helpful from one that is merely fluent, or distinguish a response that is honest from one that sounds confident but is wrong. The objective it was trained on — maximize next-token log-likelihood on human-curated demonstrations — gives it no direct signal about this.

Consider two responses to "What is the capital of Australia?" Both might be grammatically perfect. One says "Canberra." The other says "Sydney — it is the largest and most famous city." The second is wrong, but a next-token model trained on scraped text might assign it equal or higher probability because "Sydney" appears near "Australia" far more often in natural text. Likelihood and preference are not the same thing. Alignment is the engineering discipline that closes that gap.

Why this matters for inference engineers

You will rarely run alignment yourself — it is expensive, slow, and usually done once by the model's creator. But you must understand what the model you serve has been through:

An aligned model carries a learned notion of preference that shapes every generation, including refusals, hedges, and format choices.
The system prompt is the primary lever to activate or redirect that preference. Understanding alignment helps you write better system prompts.
Behavioral guardrails are a product of alignment, not magic. They can be incomplete, inconsistent, or miscalibrated — knowing why helps you debug generation quality issues.
The reference model is needed only at training time. At inference, you serve just the policy. Understanding this clarifies what "the model" is and why swapping system prompts doesn't change the weights.

Learning objectives

Articulate why next-token likelihood is insufficient for alignment and why pairwise preferences are a better training signal.
Lay out the three-stage pipeline: pretrain → SFT → preference optimization.
Explain preference data collection: what a (prompt, chosen, rejected) triple looks like and why comparisons beat ratings.
Implement and explain the Bradley-Terry reward-model objective.
Describe RLHF with PPO: the KL penalty, reward hacking, and the four-model training rig.
Derive the intuition behind DPO and write its loss with a plain reading.
Contrast RLHF/PPO with DPO on complexity, stability, compute, and memory.
Place RLAIF, Constitutional AI, GRPO, and DPO variants (IPO, KTO) as variations on the same theme.
Connect alignment to inference: what the served model is, and how alignment affects decoding.

The Alignment Problem

Pretrained models are capable but uncalibrated. Likelihood ≠ preference.

A pretrained model learns the distribution of text on the internet. That distribution includes helpful answers, wrong answers, spam, sycophancy, toxic content, and everything in between. The model has no internal notion that one of these is better than another — it only knows which continuations are probable. SFT narrows the distribution to the format of good responses, but it still does not teach the model to rank candidate outputs.

The alignment problem, stated precisely: given a prompt x, there are many valid continuations. We want the model's generation policy to prefer the continuations a thoughtful human would prefer — what OpenAI called helpful, harmless, and honest (HHH). We need a training signal for this, and raw log-likelihood on demonstrations doesn't provide it because demonstrations only teach "what" not "which."

Why not just label more demonstrations?

A natural first instinct is to simply collect more high-quality demonstrations and fine-tune on them. This works partially — it is what SFT does. But it fails to scale for two reasons. First, writing a gold-standard response is expensive and slow; asking a human to compare two candidate responses is much faster and surprisingly consistent. Second, comparison is easier to do correctly: humans disagree on what an ideal answer looks like, but they agree much more on which of two given answers is better. Pairwise comparisons give denser signal per human-hour.

The alignment gap. A pretrained + SFT model is fluent but has no notion of which response is better. Alignment trains the model to prefer responses that score high on human preferences, while keeping it from drifting into incoherence (the KL anchor).

The Pipeline

Three stages: pretrain, imitate, then learn what's preferred.

The canonical alignment recipe, introduced at scale by InstructGPT (Ouyang et al. 2022), has three stages. You already built the first two.

The four-stage full pipeline (stages 0–3). Classic RLHF trains an explicit reward model then optimizes with PPO. DPO (dashed gold arrow) fuses stages 2 and 3 into a single supervised loss on preference pairs — no separate reward model, no RL loop.

Pretraining (Day 8–9). Train on trillions of tokens with next-token prediction. Builds world knowledge, language modeling, and zero-shot capabilities. The model at this stage is powerful but raw.
Supervised fine-tuning / SFT (Day 13). Fine-tune on a curated set of (instruction, ideal response) demonstrations, often 10k–100k pairs. Teaches the model the format and style of helpful responses. Still no notion of ranking.
Reward modeling + policy optimization (today). Collect human (or AI) pairwise comparisons. Train a reward model on them. Then update the language model to maximize that reward — either with PPO (RLHF) or directly with DPO. The output of this stage is the model you serve.

Preference Data

Humans rank pairs. The dataset is (prompt, chosen, rejected) triples.

Asking a human for an absolute quality score on a scale of 1–10 is noisy and poorly calibrated — one annotator's 7 is another's 5. Asking them to compare two responses — "is A better than B?" — is far more reliable and faster. So preference datasets are collections of triples: a prompt x, a chosen response y_w (the winner, "w" for "winning"), and a rejected response y_l (the loser, "l" for "losing").

A preference data triple. The annotator sees both responses and marks one as preferred. Notice the instruction says "one sentence" — the rejected response violates it. This kind of violation is easy for humans to judge even when they couldn't write a perfect answer themselves.

In practice, labelers are given rubrics that operationalize helpfulness, harmlessness, and honesty. Responses are compared on criteria like accuracy, instruction-following, appropriate length, and safety. Multiple annotators often rate the same pair, and inter-annotator agreement is tracked. Some datasets (Anthropic HH, OpenAI WebGPT comparisons, UltraFeedback) are public. The quality and coverage of this preference data is the single biggest lever on final model quality — more than the specific algorithm used in stage 3.

Increasingly, AI models (strong frontier LLMs) serve as labelers — this is RLAIF (Reinforcement Learning from AI Feedback). This scales data collection by orders of magnitude but introduces a circularity risk: you are using one model's preferences to train another. Careful calibration and human spot-checking are essential.

Reward Modeling

Train a model to score responses. The Bradley-Terry objective.

The reward model r_φ(x, y) is a neural network that takes a (prompt, response) pair and outputs a single scalar — the "quality" of that response to that prompt. It is usually the SFT model with its language-modeling head (the unembedding matrix over vocabulary) replaced by a single linear projection to a scalar. This gives it a head start: it already understands language and instruction-following before it learns to rank.

The Bradley-Terry preference model

The Bradley-Terry model is a classical statistical model for pairwise comparisons. Its key assertion: the probability that y_w beats y_l in a comparison depends only on the difference of their underlying "strength" scores. Applied to rewards:

P(y_w ≻ y_l | x) = σ( r(x, y_w) − r(x, y_l) ) where σ(z) = 1 / (1 + e^{−z}) [the logistic sigmoid] Worked example: r(x, y_w) = 2.0, r(x, y_l) = 0.5 difference = 1.5 P(y_w wins) = σ(1.5) = 1/(1+e^{-1.5}) ≈ 0.818 So the reward model is 82% confident the chosen response is better. If the margin is 0, it is 50%: indifferent. If negative, it prefers the "rejected" one.

The reward-model loss maximizes this probability over all training triples — equivalently, minimizes the negative log-probability:

L_RM(φ) = − E_{(x, y_w, y_l) ~ D}[ log σ( r_φ(x, y_w) − r_φ(x, y_l) ) ] This is binary cross-entropy applied to the reward difference. The loss pushes r(x, y_w) − r(x, y_l) to be large and positive for every training triple.

Reward model forward pass. The same SFT-backbone transformer processes (prompt + chosen) and (prompt + rejected) separately, outputting scalars r_w and r_l. The Bradley-Terry loss minimizes −log σ(r_w − r_l), pushing the chosen score higher and the rejected score lower on every training step.

After training, the reward model is a fully automatable proxy for human judgment. Feed it any (prompt, response) pair and get back a quality scalar in milliseconds — far faster than asking a human. This is what makes stage 3 possible: you can generate millions of responses from the evolving policy and score them without a human in the loop. The reward model's fidelity to human preferences is the bottleneck for alignment quality, which is why preference data curation is so carefully managed at frontier labs.

The reward model sees the same text as the language model but predicts one number instead of a distribution over tokens. This architectural simplicity is the point — it collapses the complexity of preference into a single comparable score. The famous Goodhart's Law applies immediately: the reward model is only a proxy for human preference. Any model trained to maximize a proxy will eventually find edge cases where high proxy score ≠ high actual quality. This is the origin of reward hacking.

RLHF with PPO

Treat generation as RL: the LLM is a policy, the reward model is the environment.

With a reward model trained, stage 3 of RLHF frames text generation as a reinforcement learning problem. The language model is the policy π_θ. Generating a response is taking a sequence of actions — one token at a time — each drawn from the policy's probability distribution. The episode ends when the response is complete; the reward model then scores the entire response and returns a scalar reward. We want to optimize θ so the expected reward increases.

The PPO objective with KL penalty

Using raw RL (REINFORCE or vanilla policy gradient) on language models is disastrously unstable — the policy wanders into incoherence chasing marginal reward improvements. Proximal Policy Optimization (PPO) addresses this by clipping large policy updates so no single gradient step moves too far. But for LLMs there is a second, crucial stabilizer: the KL penalty:

Maximize J(θ) = E_{x ~ D, y ~ π_θ(·|x)}[ r_φ(x, y) ] − β · KL( π_θ(y|x) ‖ π_ref(y|x) ) π_θ = the policy we are training (the LLM) π_ref = the frozen SFT model (the reference) r_φ = the trained reward model β = KL coefficient (typically 0.01–0.1) KL(P‖Q) = Σ_y P(y) log P(y)/Q(y) [per-token, summed over the sequence] Worked example: r_φ(x, y) = 3.2, KL = 2.1, β = 0.1 J = 3.2 − 0.1·2.1 = 3.2 − 0.21 = 2.99 The policy gains from high reward but is penalized for drifting from the reference.

The KL term is computed per token, summed over the sequence, and added to the reward at each step. In practice the total reward for a complete response is:

R(x, y) = r_φ(x, y) − β · Σ_{t=1}^{T} log[ π_θ(y_t | x, y_{<t}) / π_ref(y_t | x, y_{<t}) ]

Why the KL term is non-negotiable

Without the KL penalty, the policy will reward-hack. The reward model is an imperfect proxy — it has blind spots, biases, and systematic errors introduced by the preference labelers. A relentless optimizer like PPO discovers those blind spots and exploits them. Classic reward-hacking failure modes:

Length bias. If the reward model has a latent preference for longer responses (because humans often perceive longer as more thorough), the policy learns to pad responses with content-free repetition.
Sycophancy. If human labelers preferred responses that agreed with their (sometimes wrong) prior beliefs, the policy learns to tell users what they want to hear.
Safe non-answers. Refusals are always "safe" in the sense that they are unlikely to contain factual errors. If refusal is rewarded over an uncertain answer, the policy learns to refuse too broadly.

The KL penalty confines the policy to a neighborhood of the SFT reference. Within that neighborhood, real improvements are possible (higher reward without dramatic drift). Outside it, the policy is penalized — which prevents it from exploiting reward-model blind spots that exist far from the sensible SFT distribution.

The RLHF/PPO training loop. The policy generates responses on-policy; the reward model scores them; the PPO update uses the reward minus a KL penalty relative to the frozen reference. A separate critic (value) network estimates baselines. All four models must fit in GPU memory simultaneously — the primary reason RLHF is expensive.

Operational complexity

RLHF with PPO is operationally brutal. Four models must fit in GPU memory simultaneously: the policy π_θ (being trained), the frozen reference π_ref, the frozen reward model r_φ, and a separate critic / value network V_ψ used to estimate advantage baselines for PPO. For a 7B-parameter model, that is 4× the base memory footprint. The RL is notoriously hyperparameter-sensitive — KL coefficient β, PPO clip ratio, learning rate schedule, rollout length all interact in nonlinear ways. Training instabilities, reward hacking, and mode collapse are common failure modes. This is the most finicky part of the entire LLM stack.

DPO

Skip the reward model and the RL loop. Optimize preferences directly.

Direct Preference Optimization (DPO, Rafailov et al. 2023) is the result that reshaped alignment. The key insight: the RLHF objective — maximize reward under a KL constraint — has a closed-form optimal policy. Given a reward function r and reference model π_ref, the policy that maximizes the constrained objective is:

π*(y|x) ∝ π_ref(y|x) · exp( r(x,y) / β ) This is the Boltzmann/softmax distribution over responses, weighted by reward. The key: we can rearrange this to express r in terms of π and π_ref: r(x,y) = β · log[ π*(y|x) / π_ref(y|x) ] + β · log Z(x) where Z(x) is a partition function (the normalizing constant, same for y_w and y_l). Substituting into the Bradley-Terry loss — Z(x) cancels! — gives:

L_DPO(θ) = − E_{(x, y_w, y_l) ~ D}[ log σ( β · ( log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x) ) ) ] log π_θ(y_w|x)/π_ref(y_w|x) = log-ratio of policy vs ref on CHOSEN log π_θ(y_l|x)/π_ref(y_l|x) = log-ratio of policy vs ref on REJECTED β = temperature / KL anchor strength

Plain reading: the DPO loss pushes the policy to assign relatively higher probability (relative to the reference) to the chosen response and relatively lower probability to the rejected response. The reference model provides the baseline. The β parameter plays the same KL-anchoring role as in PPO — it is now baked directly into the loss rather than added as a separate term. The "implicit reward" of a DPO-trained policy is β · log(π_θ / π_ref).

What DPO eliminates

DPO needs:

No separate reward model. The reward is implicit in the policy ratio. Stage 2 of the RLHF pipeline disappears entirely.
No on-policy sampling. The preference pairs are fixed training data, just like SFT. No need to generate responses during training.
No RL algorithm. The loss is a standard supervised (cross-entropy-like) loss. Stable, predictable, easy to implement.
Two models in memory (policy + frozen reference), not four.

RLHF/PPO versus DPO. RLHF requires four sequential steps including a separate reward model and on-policy sampling. DPO collapses this to a single supervised loss on the same preference pairs — the reward model is implicit in the log-probability ratio.

The practical difference is enormous. DPO is stable, debuggable, and fast to iterate. For most teams — running models in the 7B–70B range, with hundreds of thousands of preference pairs — DPO or one of its close variants is the default. RLHF/PPO is still used at the frontier where its extra control (and ability to use a separately trained reward model for filtering) is worth the pain. But DPO democratized alignment the way LoRA democratized fine-tuning.

Dimension	RLHF + PPO	DPO
Reward model needed?	Yes — separate, explicitly trained	No — implicit in log-ratio
RL loop required?	Yes — on-policy PPO	No — supervised loss
On-policy sampling?	Yes — throughout training	No — fixed preference pairs
Models in memory	4 (policy, ref, RM, critic)	2 (policy, frozen ref)
Training stability	Finicky — many hyperparameters	Stable — like fine-tuning
Implementation complexity	High	Low
Output quality (frontier)	Competitive, max control	Slightly below PPO at scale
Typical adopters	OpenAI, Anthropic, DeepMind	Most open-weight models

The Wider Family

RLAIF, GRPO, Constitutional AI, KTO — the same principle, different angles.

Alignment is one of the most active research areas in ML. The following variations appear constantly in papers, blog posts, and model cards. The core pattern is always the same: a preference signal (from humans, AI, or verifiers) and a training algorithm that uses it to push the policy toward preferred behavior. The axes of variation are where the signal comes from and how you optimize against it.

Variations in the feedback source

RLAIF (Reinforcement Learning from AI Feedback). Replace expensive human preference labels with judgments from a strong LLM (often a "constitutional" model). Scales preference data collection by 100×; most modern post-training mixes human and AI feedback. The risk is feedback laundering — the AI labeler's biases become the policy's biases.
Constitutional AI (Anthropic, Bai et al. 2022). A structured form of RLAIF where the AI critiques and revises its own outputs against a written set of principles (a "constitution"), then generates preference data from the revised outputs. The constitution makes the value system explicit and auditable — a key advantage over opaque human labeling.
RL with verifiable rewards. For tasks with ground-truth answers — math, code, formal logic — you can reward the model directly for correctness without a trained reward model. This is the basis of reasoning-model training (DeepSeek-R1, o1-style). GRPO (below) is the typical optimization algorithm here.

Variations in the optimization algorithm

GRPO (Group Relative Policy Optimization, Shao et al. 2024). A PPO variant that drops the separate critic/value network. Instead of computing a value baseline from a separate model, GRPO samples a group of responses to the same prompt and estimates the advantage of each response relative to the group mean. This cuts memory from 4 models to 3 and is more stable. Used heavily in DeepSeek-R1 and other reasoning models.
IPO (Identity Preference Optimization). A DPO variant that adds a regularization term to prevent overfitting to the preference pairs. When DPO is trained for many steps, the policy can overfit to the specific chosen/rejected pairs in the dataset; IPO's regularizer prevents this.
KTO (Kahneman-Tversky Optimization). A DPO variant that relaxes the requirement for paired comparisons. KTO can train on unpaired good/bad labels — "response A was good" without needing a corresponding bad response. Named after Kahneman-Tversky prospect theory because it models the asymmetry in how humans weight gains vs losses.
ORPO (Odds Ratio Preference Optimization). Folds preference optimization directly into SFT by modifying the standard language modeling loss. No need for a separate stage 3 at all — you run SFT and preference optimization in a single training run.

The throughline: understand the reward-model + KL-anchored-policy core and every one of these is a recognizable variation. The taxonomy of future methods will also fit this frame.

Inference Relevance

Only the policy is served. Alignment lives in the weights — and in the system prompt.

As an inference engineer, here is what alignment means concretely for your work:

What you actually serve

At inference time, you serve only the policy π_θ. The reference model π_ref is needed only to compute the KL penalty during training — at inference, it is gone. The reward model r_φ is similarly training-only. So a 7B-aligned model served in production is exactly one 7B model. No overhead from the training apparatus.

How alignment shapes generation

Alignment is baked into the weights through gradient updates on the preference pairs. Every generation from the aligned policy reflects learned preferences: which formats to use, how to hedge uncertainty, when to refuse, how to be concise. These behaviors are not enforced at inference — they emerge from the model's probability distribution over tokens. You cannot "turn them off" via decoding without significant effort (jailbreaks exist but are a cat-and-mouse game).

The system prompt as a soft amplifier

The system prompt is your primary lever. The alignment training teaches a space of behaviors; the system prompt steers the model to a region of that space. A well-written system prompt can encourage more concise responses, different tones, or more conservative / liberal handling of edge cases. But it cannot override the fundamental alignment — it is steering within the learned distribution, not outside it. This is why writing system prompts requires understanding what the model was aligned on.

Decoding and alignment interaction

Temperature. Lower temperature sharpens the distribution — makes the policy behave more like argmax over its learned preferences. Often improves alignment reliability at the cost of diversity.
Top-p / Top-k. Truncating the tail of the distribution removes low-probability (and often high-entropy) continuations. For aligned models this often improves helpfulness by preventing drift to rare, poorly-aligned tokens.
Repetition penalty. Can interact with length-aligned models (those trained to avoid verbosity) unpredictably — sometimes amplifying the reward-hacking artifacts from imperfect alignment.

Serving the reference model at training time

If you are building an alignment pipeline (e.g. with TRL or OpenRLHF), you will need to serve the reference model alongside the policy. For DPO, both forward passes can be batched together since the reference is frozen — some frameworks interleave them in the same batch to save memory. For RLHF, the reference model needs to run on each sampled response, which roughly doubles inference throughput requirements during training.

The DPO implicit reward β · log(π_θ(y|x) / π_ref(y|x)) can be computed at inference time to score candidate responses without a separate reward model. Some production systems use this as a reranking signal: generate K responses, score each with the implicit reward, return the highest-scoring one. This is a cheap way to get reward-model-like behavior at inference with zero additional model parameters.

Exercise

Seven exercises in the notebook.

Companion notebook: day-14-alignment.ipynb. All exercises run on CPU in seconds.

Bradley-Terry reward loss. Implement bt_loss(r_chosen, r_rejected) and verify: as the margin grows, the loss falls; at margin 0 the loss is ln 2; at margin 3 it is near 0.
Train a toy reward model. On synthetic "shorter is better" preferences, train a scalar reward head and measure preference accuracy on held-out pairs. Plot the training loss curve.
Implement the DPO loss. Write dpo_loss(pol_chosen_logp, pol_rejected_logp, ref_chosen_logp, ref_rejected_logp, beta) from scratch using log-ratios. Verify it equals ln 2 when policy equals reference.
DPO on toy preferences. Initialize a tiny toy policy at the same distribution as a frozen reference. Run DPO on synthetic preference pairs. Show that the chosen-vs-rejected log-prob margin grows over training steps.
KL anchoring and the β sweep. Sweep β over [0.05, 0.1, 0.5, 1.0]. Show that small β allows larger drift from the reference and larger margin, while large β stays close but learns slowly. Plot drift vs margin.
Per-token KL penalty. Compute the per-token KL between two distributions and show it sums to the sequence-level KL. Verify numerically.
Reward hacking demo. Build a flawed reward (rewards length). Show that maximizing reward without a KL anchor degrades the policy to pathological outputs. Add the KL anchor and show it stays near the reference.

Self-Check

Ten questions before moving on.

Close the page and answer from memory. If you cannot, re-read the relevant section.

Why is next-token log-likelihood not a sufficient training objective for alignment? Give a concrete example where high likelihood and high preference diverge.
What are the three (or four) stages of the canonical alignment pipeline? What does each stage produce?
Why are pairwise comparisons better training data than absolute quality ratings? Name two practical reasons.
Write the Bradley-Terry reward-model loss and explain every term. What does it optimize?
In RLHF, identify: the policy, the reference model, the reward model, and the critic. What does each do? Which are frozen?
What is reward hacking? Give two realistic failure modes. How does the KL penalty mitigate each?
Derive, in one sentence, why DPO can skip the explicit reward model. What mathematical fact makes this possible?
Write the DPO loss and explain the role of β. What is the "implicit reward" in a DPO-trained policy?
How many models does RLHF/PPO require in memory? How many does DPO require? What are they?
At inference time: does the reference model need to be served? What is the implicit reward and how could you use it for reranking?

Week 2 Wrap-Up

You can now train an LLM — and understand every model you'll ever serve.

What you've covered

The pre-training objective, data pipeline, and scaling laws (C ≈ 6ND, Chinchilla) — Days 8–9.
A tiny GPT trained end-to-end, from forward pass to gradient — Day 9 capstone.
A production training loop: mixed precision, gradient clipping, accumulation, schedules, MFU — Day 10.
Distributed training: DP, FSDP/ZeRO, tensor and pipeline parallelism — Day 11.
Modern architecture: RoPE, RMSNorm, SwiGLU, GQA/MQA, MoE — Day 12.
Fine-tuning: SFT, LoRA, QLoRA, and how adapters are served — Day 13.
Alignment: preference data, reward modeling (Bradley-Terry), RLHF/PPO, and DPO — Day 14.

What's next — Week 3

We pivot from training to inference and hardware — the heart of this course. Prefill versus decode, GPU architecture and the roofline model, your first CUDA kernel, Apple Silicon and MLX, the KV cache in depth, and FlashAttention. Everything you built in Weeks 1–2 becomes the thing you now make fast.

If you want to consolidate, the best exercise is to take your Day 9 GPT, upgrade it with Day 12's components (RoPE + RMSNorm + SwiGLU + GQA), fine-tune it with a Day 13 LoRA adapter, and align it on toy preferences with a Day 14 DPO loss. End to end, your own hands, every line. That is the entire training stack — and you own it.

Go deeper.

The alignment canon.

Paper · 2022

Ouyang et al. — InstructGPT

The three-stage SFT → RM → PPO pipeline, at scale. The blueprint for modern assistant models.

Open paper

Paper · 2023

Rafailov et al. — DPO

"Your Language Model is Secretly a Reward Model." The closed-form derivation and its implications.

Open paper

Paper · 2017

Christiano et al. — Deep RL from Human Preferences

The original preference-based RL paper. The conceptual foundation that InstructGPT scaled.

Open paper

Paper · 2017

Schulman et al. — PPO

Proximal Policy Optimization. The RL algorithm behind RLHF — understand the clip objective and the value network.

Open paper

Paper · 2022

Bai et al. — Constitutional AI

RLAIF: alignment with AI feedback against a written constitution. Scales preference data, makes values explicit.

Open paper

Paper · 2024

Shao et al. — DeepSeekMath (GRPO)

Group Relative Policy Optimization: removes the separate critic, uses group-sampled baselines. Key to reasoning models.

Open paper

Library · HF

huggingface/trl

Production SFT, reward modeling, PPO, DPO, KTO, GRPO trainers. The fastest path from theory to working code.

View repo

Book · Lambert

Nathan Lambert — The RLHF Book

A free, current, book-length treatment of the whole alignment stack from preference data to production.

Read online

Paper · 2025

DeepSeek-R1

Large-scale reasoning model trained with RL (GRPO) on verifiable rewards — the frontier of alignment meeting inference engineering.

Open paper