The post-training pass that turns a parrot into an assistant.
Reading time10-12 minAudionarration availablePrerequisites02SourceTrack A · Gemini
§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 03 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
§ 2
The lesson itself.
Interactive lesson · ported from Gemini trackClick tabs to navigate · hover cards for details
A base model is not an assistant — it just completes text
After pretraining, a model knows how to predict the next token but has no concept of "being helpful" or "avoiding harm." Ask it a question and it may continue with more questions. RLHF (Reinforcement Learning from Human Feedback) bridges this gap — teaching the model what humans actually want from it.
Stage 1 SFT
→
Stage 2 Reward model
→
Stage 3 RL (PPO / GRPO)
→
Aligned model
Stage 2 — The reward model
A neural network that predicts human preference
Human annotators are shown pairs of responses (A vs B) to the same prompt and pick which they prefer. A reward model is trained on (prompt, chosen, rejected) triplets to predict which response a human would prefer — assigning a scalar score. Typically a transformer with a linear head on top.
Bradley-Terry formulation
The probability model behind preference learning
If response y_w is preferred over y_l, the probability of that preference is modelled as σ(r(x,y_w) − r(x,y_l)), where σ is sigmoid. The loss is: −log σ(r_w − r_l). This elegant formula says: push the reward of the chosen response higher than the rejected one. No absolute score needed — only relative preference.
On-policy vs off-policy alignment
Two fundamentally different approaches to using preference data
On-policy (PPO, GRPO): The model generates new responses during training. The reward model scores them. The RL algorithm updates the model's weights. Expensive — inference is slow — but achieves the highest performance ceiling because the model always trains on its own current behaviour.
Off-policy (DPO, ORPO): Training uses pre-collected preference data without generating new responses during training. Simpler, faster, more stable — but the model never explores beyond the fixed dataset. Less adaptive than on-policy methods for complex reasoning tasks.
Proximal Policy Optimization
The dominant on-policy RL algorithm for LLM alignment
PPO (Schulman et al. 2017) became the backbone of ChatGPT's alignment process (InstructGPT). In the LLM setting, the "policy" is the LLM, "actions" are tokens, "states" are the generation so far, and the "reward" comes from the reward model at the end of each complete response.
PPO-KL
PPO-Clip
Advantage (GAE)
Critic / value model
Reward hacking
Four models required during PPO training
Policy π_θ
LLM being trained weights updated
Reference π_ref
Frozen SFT model KL anchor
Reward R_φ
Frozen reward model scores full responses
Critic V_γ
Value function weights updated
Policy + Critic = 2 LLM copies trained simultaneously → huge memory cost. This is what GRPO eliminates.
PPO combined objective
Three terms: 1. Clipped surrogate — maximise advantage while staying close to old policy (clip at ε=0.2). 2. KL penalty — subtract β×KL(π_θ ‖ π_ref) to prevent reward hacking (β=0.02–0.1). 3. Value loss — train the critic to predict future rewards accurately.
PPO limitations
Memory: 4 models in GPU simultaneously — 80 GB+ for 7B model. Stability: Balancing policy + critic training is notoriously tricky. Speed: Must generate completions on-policy before each update. Complexity: Many hyperparameters to tune simultaneously.
Best-of-N (Rejection Sampling)
The simplest alignment strategy — generate many, keep the best
Generate N responses from the model for each prompt, score all N with the reward model, keep only the highest-scoring one as training data, then fine-tune the model on these "best" responses via SFT. No RL required. Llama 2's alignment used four rounds of rejection sampling before any RL. Simple and effective for moderate alignment goals, but computationally wasteful — most generations are discarded.
Model
→
N responses r₁ r₂ … rN
→
Reward model
→
Best response
→
SFT on best
Direct Preference Optimization (DPO)
Skip the reward model entirely — optimise directly on preference pairs
DPO (Rafailov et al. Stanford 2023) made a key mathematical observation: the optimal RLHF policy has a closed-form solution that can be derived directly from preference data — no separate reward model required. By rearranging the RLHF objective, DPO shows the reward is implicitly encoded in the ratio of log-probabilities between the policy and a reference model.
y_w = chosen response · y_l = rejected response · β controls deviation from reference · No reward model · No RL · SFT-style training
DPO advantages
No reward model to train. No RL instability. Trains like standard supervised fine-tuning. Memory efficient — only 2 models needed (policy + frozen reference). More stable and reproducible than PPO. Widely used in open-source models.
DPO limitations
Off-policy — trains on fixed preference data, never explores new responses. Cannot handle tasks requiring complex reasoning chains. Quality depends entirely on the preference dataset. Generally weaker than on-policy RL for hard reasoning tasks like math and code.
Direct Preference Optimization: Your Language Model is Secretly a Reward Model — Rafailov et al., Stanford 2023 (arXiv:2305.18290). Closed-form RLHF without a separate reward model.
Training language models to follow instructions (InstructGPT) — Ouyang et al., OpenAI 2022 (arXiv:2203.02155). The original RLHF paper for chat models.
Reasoning LLMs — a new training paradigm
Teaching models to "think before they answer"
Standard RLHF teaches models to be helpful and safe. Reasoning training — Reinforcement Learning from Verifiable Rewards (RLVR) — teaches models to solve hard problems by generating long chain-of-thought reasoning traces. The key difference: rewards come from verifiable answers (math results, code tests) rather than from a learned preference model. These rewards are binary, objective, and unhackable.
What changes from standard RLHF
Reward source: Verifiable ground truth — the math answer is correct or not, the code compiles and passes tests or not. Format rewards: Additional rewards for using the correct thinking format (e.g., reasoning inside <think> tags). Emergent behaviours: Models spontaneously develop self-correction, "aha moments," extended deliberation, and backtracking.
DeepSeek-R1-Zero — the surprising finding
Skip SFT entirely — RL directly on base model
Starting from a base model (DeepSeek V3) with NO SFT phase, pure GRPO training on verifiable rewards produced emergent reasoning. The model learned to allocate more thinking time to harder problems, self-evaluate, and backtrack — all without being shown any chain-of-thought examples.
DeepSeek-R1 four-stage training pipeline
Stage 1 Cold-start SFT few thousand CoT
→
Stage 2 GRPO RL verifiable rewards
→
Stage 3 Rejection sampling 600K reasoning data
→
Stage 4 Final GRPO helpfulness + safety
→
DeepSeek R1
RLVR loss function
J(θ) = E [ R(x,y) ] − β · KL(π_θ ‖ π_ref)
R(x,y) = verifiable reward (1 if correct, 0 if wrong) · β prevents drift from reference policy · In GRPO: advantage normalises rewards within group
DeepSeekMath: Pushing the Limits of Mathematical Reasoning — Shao et al. 2024 (arXiv:2402.03300). Introduced GRPO. First demonstration that RL without SFT improves mathematical reasoning.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL — DeepSeek-AI 2025 (arXiv:2501.12948). R1-Zero: pure RL on base model. Open weights. Competitive with o1 on reasoning benchmarks.
Group Relative Policy Optimization
PPO without the critic — DeepSeek's key innovation
GRPO (Shao et al. 2024) eliminates the critic (value function) from PPO by estimating the baseline from the average reward of a group of responses generated for the same prompt. Removes one entire model from PPO's 4-model setup, saving 40–60% VRAM and up to 18× cost in DeepSeek's experiments.
A value function must predict future reward from intermediate text — notoriously hard. GRPO sidesteps this by observing multiple complete trajectories for the same prompt. The group average reward is a Monte Carlo estimate of expected reward — simpler, more direct, needing no separate network. Key: G ≥ 16, typically 64.
DeepSeekMath hyperparameters
G = 64 outputs per prompt · Batch = 1,024 · Learning rate = 1e-6 · KL coefficient = 0.04 · Max sequence length = 1,024 tokens · Single policy update per exploration stage · Training data: GSM8K + MATH (chain-of-thought format).
Evaluating reasoning models
How we measure whether a model can actually think — click each benchmark
Reasoning benchmarks test multi-step problem solving, not pattern matching. Key metrics: Pass@k (does at least one of k attempts succeed?), majority voting (most common answer from N samples), and exact match on verifiable answers. Frontier models have saturated many older benchmarks — the field continuously moves to harder problems.
GSM8KMATHAIMEHumanEvalMMLU / MMLU-ProGPQA DiamondLiveCodeBenchBIG-Bench Hard
Evaluation metrics for reasoning models
Pass@k
Generate k independent responses. Pass if at least one is correct. Pass@1 = accuracy. Pass@100 = ceiling of the model's potential. DeepSeek-R1 is evaluated with Pass@1 averaged over 32 samples on AIME.
Majority voting
Generate N responses (e.g. 32), return the most common answer. Much more reliable than a single sample. Often boosts AIME scores by 10–20 percentage points over Pass@1 alone.
Exact match
Compare model answer to ground truth exactly. Used for math (numeric answers) and multiple-choice. For code, unit tests replace exact match — the generated code must run and produce correct outputs for all test cases.
Benchmark contamination
A critical risk: if training data contains benchmark questions, scores are inflated. A 2023 study found removing contaminated examples dropped GSM8K scores by up to 13% for some models. Always check contamination disclosures in model papers.
~97%
Frontier GSM8K (saturated)
~70%
DeepSeek-R1 on AIME 2024
~90%
Frontier GPQA Diamond
§ PAPERS
Further reading.
The canonical references for this module. External links open in a new tab.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.