Demo · Module 16 · Interactive
Two slider reward signals. Watch DPO chase the gap.
Preference pair
"How do I fix this bug in my Python code?"
CHOSEN · r_c (preferred response)
The error suggests the variable is undefined. Add a check before line 42.
REJECTED · r_r (worse response)
I am not able to help with code questions.
Reward margin (r_c − r_r)
2.00
DPO loss = −log σ(β·margin)
0.13
What DPO replaces. Classic RLHF uses PPO: train a separate reward model on preference data, then optimize the policy to maximize that reward (with KL penalty against a reference). Two models, two training loops, a lot of moving parts. DPO (Direct Preference Optimization) shows that you can skip the reward model entirely and optimize the policy directly on preference pairs. The loss is
L = -log σ(β · (r_c - r_r)) where r is the policy's implicit log-probability ratio against the reference model. What you're seeing: as the chosen response's reward rises above the rejected response's, the loss drops along a sigmoid curve. The slope is steepest near a margin of zero — that's where the gradient is largest and the model learns fastest. When the model is already confident (margin > 4), the loss plateaus near zero and there's little signal left. Beta (β) controls how much the model is allowed to drift from the reference: low β lets the policy move freely, high β keeps it close. Typical β values are 0.1-0.5.