TRL + GRPO — AI Learning Course

§ 01

What is TRL?

Full-stack post-training

TRL turns a raw pre-trained LLM into an aligned reasoning agent

TRL (Transformers Reinforcement Learning) is Hugging Face's production-grade library for every stage of post-training: from supervised fine-tuning through offline preference optimisation and highly scalable online GRPO. Every trainer is a thin subclass of transformers.Trainer — familiar API, production performance.

Infographic 1 — TRL architecture & dependency stack

Minimal wrapper

Every trainer is a thin subclass of transformers.Trainer

Config-first

Hyperparameters in CLI-serialisable @dataclass configs

Memory-efficient

Native PEFT/LoRA, gradient checkpointing, padding-free modes

Agentic

Tool-calling and OpenEnv integrations built-in

§ 02

Alignment pipeline

The modern alignment pipeline

Three stages transform a raw pre-trained LLM into an aligned reasoning agent

Every frontier assistant model goes through this pipeline. TRL covers all three stages with production-grade trainers. The key insight: alignment is not a single step — it is a staged progression from imitation (SFT) to preference learning (DPO/reward model) to reinforcement optimisation (GRPO/PPO).

Infographic 2 — The modern alignment pipeline (redrawn)

Infographic 10 — Framework relationship: breadth vs depth

§ 03

GRPO deep dive

Group Relative Policy Optimisation

GRPO eliminates the value model from PPO — cutting VRAM by roughly 25–50%

GRPO generates a group of completions for each prompt, scores them with a verifier, then uses the group's own average reward as the baseline — no separate value model needed. Removing the critic saves roughly 25–50% VRAM; the "up to 90%" figures you see come from Unsloth's whole-stack optimisation on top of GRPO, not from the algorithm itself. GRPO variants (DAPO, Dr. GRPO, GSPO) now ship in mainstream trainers, and vLLM-backed rollouts inside GRPOTrainer are what make online RL tractable — at lab scale, verl and OpenRLHF are the heavy-duty alternatives.

Infographic 6 — GRPO: the AI reasoning engine — 5 steps

Infographic 5 — RLVR & patience is all you need

The Concept

If the probability of the correct answer is > 0, an untrained model called infinitely will eventually guess correctly ("Patience is All You Need"). RL accelerates this by actively learning from bad signals (0s) to prune the output distribution away from wrong answers.

§ 04

PPO vs GRPO

PPO is the heavyweight predecessor — GRPO is the lightweight successor

Both are on-policy RL algorithms that keep the model close to a reference policy via KL regularisation. The critical difference: PPO needs a separate value model to estimate future rewards. GRPO eliminates this by using group statistics as the baseline instead.

Infographic 7/9 — PPO architecture vs GRPO architecture

VRAM comparison — 8B model, 20K context window

PPO — what the value model does

The value model V(s) predicts total future reward from each token position. This lets PPO compute accurate advantage estimates at every step. Training the value model alongside the policy doubles VRAM usage and adds instability — it must be trained at the same time as the policy it is trying to evaluate.

GRPO — why group statistics work

Instead of predicting future reward, GRPO observes multiple complete trajectories for the same prompt. The group average reward is a Monte Carlo estimate of expected reward — simpler, more direct, and needing no separate network. TRL's default is num_generations=8; DeepSeekMath used 64. More samples = more stable advantage estimates.

§ 05

Reward design

Designing reward functions & verifiers

The secret to robust GRPO is a rubric — not a single reward

The biggest practical lesson from training reasoning models: a single "correct/incorrect" reward is too sparse. The model gets no signal until it reaches the final answer. A rubric — a stack of smaller verifiable rewards — gives the model richer feedback at every step and leads to dramatically more stable training.

Infographic 4 — Designing reward functions & verifiers

Reward component types — click to explore

§ 06

Unsloth & efficiency

Unsloth RL — efficiency breakthrough

From dual A100s to a single 8 GB consumer GPU — same model, same task

Unsloth patches TRL's GRPOTrainer kernels to avoid materialising massive logit tensors, implements Standby Mode KV-Cache (releasing vLLM's KV cache during training to reclaim memory), and uses ultra-long context optimisations (dynamic flattened sequence chunking + offloading log-softmax activations). The result: 90% VRAM reduction on 8B models.

Infographic 8/9 — TRL + Unsloth efficiency stack

Desktop-level training — what this enables

Before Unsloth, GRPO training an 8B reasoning model required institutional compute. After Unsloth: a researcher with an RTX 3070 can fine-tune an 8B model with GRPO on math or coding tasks. This democratises reasoning model training the same way QLoRA democratised SFT in 2023. In ~100 training steps on a single consumer GPU you can observe real GRPO gains — the model adopting the reasoning format and improving on the verifier — but R1-class capability takes far more compute.

§ 07

TRL and the GRPO
algorithm.