What changes when reward signal comes from groups of completions.
Reading time10 minAudionarration availablePrerequisites03SourceTrack A · Gemini
§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 09 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
§ 2
The lesson itself.
Interactive lesson · ported from Gemini trackClick tabs to navigate · hover cards for details
TRL & GRPO — Post-Training the Modern Way
Transformers Reinforcement Learning · Hugging Face v1.0.0 · April 2026
Layers 10 – 12
Full-stack post-training
TRL turns a raw pre-trained LLM into an aligned reasoning agent
TRL (Transformers Reinforcement Learning) is Hugging Face's production-grade library for every stage of post-training: from supervised fine-tuning through offline preference optimisation and highly scalable online GRPO. Every trainer is a thin subclass of transformers.Trainer — familiar API, production performance.
huggingface/trl — v1.0.0, April 2026. Apache 2.0 · Python 99.9%
The modern alignment pipeline
Three stages transform a raw pre-trained LLM into an aligned reasoning agent
Every frontier assistant model goes through this pipeline. TRL covers all three stages with production-grade trainers. The key insight: alignment is not a single step — it is a staged progression from imitation (SFT) to preference learning (DPO/reward model) to reinforcement optimisation (GRPO/PPO).
Infographic 2 — The modern alignment pipeline (redrawn)
Pre-trained LLM (raw base model)
↓
Phase 1 — SFTTrainer
Supervised Fine-Tuning: teaches the model to follow instructions using (prompt, completion) pairs. Cross-entropy loss on completions only. Establishes the baseline capability needed before any RL.
Direct Preference Optimisation — no separate reward model. Trains on static (chosen, rejected) pairs. Simpler, more stable, but can't explore beyond fixed dataset.
Online RL
RewardTrainer
Trains preference model (Bradley-Terry) to distinguish good from bad outputs → numerical reward signal.
Infographic 10 — Framework relationship: breadth vs depth
Framework
SFT
Reward Modeling
Online RL (PPO/GRPO)
Offline RL (DPO/KTO)
Focus
TRL
✓ SFTTrainer
✓ RewardTrainer
✓ PPOTrainer / GRPOTrainer
✓ DPOTrainer / ORPOTrainer
Full pipeline coverage
Unsloth RL
✓ (patched)
— (not covered)
✓ GRPOTrainer only
— (not covered)
Speed + memory for GRPO
Key insight: Unsloth does NOT replace TRL. It is a localised speed and memory patch deployed specifically within TRL's GRPOTrainer ecosystem. They are complementary — TRL provides the API, Unsloth patches the kernel.
TRL Modern Alignment Pipeline — Hugging Face 2025. SFT → Reward → GRPO path is now the dominant approach for open-weight reasoning models.
Group Relative Policy Optimisation
GRPO eliminates the value model from PPO — reducing VRAM by up to 90%
GRPO generates a group of completions for each prompt, scores them with a verifier, then uses the group's own average reward as the baseline — no separate value model needed. This single change eliminates PPO's biggest memory bottleneck while preserving the full RL signal.
Infographic 6 — GRPO: the AI reasoning engine — 5 steps
1
Group sampling
Generate G outputs per prompt (G=8–64). Each is a different "reasoning path" to the same answer. More paths = more stable advantage estimate.
2
Score each output
Run each through a verifier (math check, code compiler, rubric). Get reward r₁…rG. Binary: 1 = correct, 0 = wrong. No learned reward model needed.
3
Group advantage
Â_i = (rᵢ − mean(r)) / std(r). Z-score standardisation. Above average → reinforce. Below average → suppress. The group IS the baseline.
4
Clipped update
L = min(r_t·Â, clip(r_t,1−ε,1+ε)·Â) − β·KL(π_θ‖π_ref). Clip ratio ε=0.2. KL coefficient β=0.04 prevents reward hacking drift.
5
No value model
PPO needs a 3rd model to predict future reward. GRPO uses group statistics instead. Eliminates 40–90% VRAM overhead. Scales to single-GPU training.
Infographic 5 — RLVR & patience is all you need
○
─────────────
prompt: "2+2"
0
X (wrong)
0
cat (wrong)
0
X (wrong)
1
4 ✓ correct!← RL reinforces this path
0
X (wrong)
0
# (wrong)
The Concept
If the probability of the correct answer is > 0, an untrained model called infinitely will eventually guess correctly ("Patience is All You Need"). RL accelerates this by actively learning from bad signals (0s) to prune the output distribution away from wrong answers.
⚠ The Constraint: If the initial probability is 0, RL will never work. This is why GRPO is applied to SFT models, not raw base models. The SFT phase ensures every correct answer has p > 0.
DeepSeekMath: Pushing the Limits of Mathematical Reasoning — Shao et al. 2024 (arXiv:2402.03300). Introduced GRPO. DeepSeek-R1 extended this to full reasoning model training.
PPO vs GRPO
PPO is the heavyweight predecessor — GRPO is the lightweight successor
Both are on-policy RL algorithms that keep the model close to a reference policy via KL regularisation. The critical difference: PPO needs a separate value model to estimate future rewards. GRPO eliminates this by using group statistics as the baseline instead.
Infographic 7/9 — PPO architecture vs GRPO architecture
PPO — Heavyweight predecessor
3 models required simultaneously
🧠
Generating policy π_θ
Current model — weights updated
⚓
Reference policy π_ref
Frozen SFT model — KL anchor
📈
Value model V_γ
Reward estimator — weights updated
Stability through proximal clipping — ratio clipped to 1±ε prevents excessively large weight updates.
KL divergence safety anchor — β coefficient keeps new model mathematically close to the reference policy.
≥ 510 GB VRAM at 8B / 20K context
GRPO — Lightweight successor
2 models — value model eliminated
🧠
Generating policy π_θ
Current model — weights updated
⚓
Reference policy π_ref
Frozen SFT model — KL anchor
📈
Value model
Replaced by group statistics
Group relative advantage
Â_i = (rᵢ − μ_group) / σ_group
Same information as a value model — estimated from the group itself. No extra parameters.
54 GB VRAM at 8B / 20K context (Unsloth)
VRAM comparison — 8B model, 20K context window
510 GB
Standard TRL + Flash Attention 2
→
54 GB
TRL + Unsloth (patched GRPO)
90%
VRAM reduction
PPO — what the value model does
The value model V(s) predicts total future reward from each token position. This lets PPO compute accurate advantage estimates at every step. Training the value model alongside the policy doubles VRAM usage and adds instability — it must be trained at the same time as the policy it is trying to evaluate.
GRPO — why group statistics work
Instead of predicting future reward, GRPO observes multiple complete trajectories for the same prompt. The group average reward is a Monte Carlo estimate of expected reward — simpler, more direct, and needing no separate network. Key: G ≥ 8, typically 16–64. More samples = more stable advantage estimates.
Proximal Policy Optimization Algorithms — Schulman et al., OpenAI 2017 (arXiv:1707.06347). PPO — the algorithm that trained InstructGPT / ChatGPT.
Designing reward functions & verifiers
The secret to robust GRPO is a rubric — not a single reward
The biggest practical lesson from training reasoning models: a single "correct/incorrect" reward is too sparse. The model gets no signal until it reaches the final answer. A rubric — a stack of smaller verifiable rewards — gives the model richer feedback at every step and leads to dramatically more stable training.
Checks absolute correctness. Assigns no numerical score. Returns True or False. Example: executing Python code and checking if it raises no errors.
Reward Function
Converts verification into a numerical score. Can apply penalties for style, formatting, length, or verbosity. Multiple reward functions can be stacked (rubric).
Prompt: "What is 2 + 2?"
Reward component
Condition
Score
Reward 1 — Format
Number detected in response
+1
No number in response
−1
Reward 2 — Accuracy
Response matches "4"
+3
Response is incorrect
−3
Total reward range: −4 to +4 · Stacked rewards create complex behaviour
Key Insight:The trick to robust GRPO is defining a rubric — a list of smaller verifiable rewards — rather than a single all-consuming reward. Each component rewards a specific desirable behaviour (format, brevity, accuracy, chain-of-thought) and they stack to create nuanced overall feedback.
Reward component types — click to explore
Accuracy
Format
Length
Code execution
Chain-of-thought
DeepSeek-R1 — DeepSeek-AI 2025 (arXiv:2501.12948). Uses 4 reward components: accuracy, format (<think> tags), length, and language consistency. The rubric approach produces dramatically more stable GRPO training than a single binary reward.
Unsloth RL — efficiency breakthrough
From dual A100s to a single 8 GB consumer GPU — same model, same task
Unsloth patches TRL's GRPOTrainer kernels to avoid materialising massive logit tensors, implements Standby Mode KV-Cache (releasing vLLM's KV cache during training to reclaim memory), and uses ultra-long context optimisations (dynamic flattened sequence chunking + offloading log-softmax activations). The result: 90% VRAM reduction on 8B models.
Infographic 8/9 — TRL + Unsloth efficiency stack
90% Memory Reduction
Unsloth patches GRPO kernels to avoid materialising massive logit tensors. 510 GB → 54 GB for an 8B model at 20K context.
Standby KV-Cache
Implements dynamic GPU memory management — releases vLLM's KV cache during training steps to reclaim memory, restores before next rollout.
380K Context Window
Ultra-long reasoning traces: dynamic flattened sequence chunking + offloading log-softmax activations. Up to 7× longer than standard TRL.
vLLM Rollout Generation — up to 86× faster completions
Standard TRL
1× baseline
TRL + vLLM
up to 86× faster during online RL rollout phase
Desktop-level training — what this enables
Before Unsloth, GRPO training an 8B reasoning model required institutional compute. After Unsloth: a researcher with an RTX 3070 can fine-tune an 8B model with GRPO on math or coding tasks. This democratises reasoning model training the same way QLoRA democratised SFT in 2023. DeepSeek-R1-level capabilities are now accessible on a gaming PC after ~100 training steps.
Unsloth RL: Transforming LLMs Into Reasoning Models With GRPO — Unsloth AI 2025. 90% VRAM reduction on GRPOTrainer. 380K context window. Desktop-level reasoning model training.
TRL: Transformer Reinforcement Learning — von Werra et al., Hugging Face 2020. v1.0.0 released April 2026. Apache 2.0.
§ PAPERS
Further reading.
The canonical references for this module. External links open in a new tab.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.