Demo · Module 16 · Interactive

Two slider reward signals. Watch DPO chase the gap.

L = -log σ(β·(r_chosen - r_rejected))
Loss drops as chosen rises above rejected
Plateaus when the model is already confident
Preference pair
"How do I fix this bug in my Python code?"
CHOSEN · r_c (preferred response)
The error suggests the variable is undefined. Add a check before line 42.
REJECTED · r_r (worse response)
I am not able to help with code questions.
Reward for CHOSEN response
implicit reward from policy log-prob
rc = +1.5
Reward for REJECTED response
implicit reward from policy log-prob
rr = -0.5
Reward margin (r_c − r_r)
2.00
DPO loss = −log σ(β·margin)
0.13
What DPO replaces. Classic RLHF uses PPO: train a separate reward model on preference data, then optimize the policy to maximize that reward (with KL penalty against a reference). Two models, two training loops, a lot of moving parts. DPO (Direct Preference Optimization) shows that you can skip the reward model entirely and optimize the policy directly on preference pairs. The loss is L = -log σ(β · (r_c - r_r)) where r is the policy's implicit log-probability ratio against the reference model. What you're seeing: as the chosen response's reward rises above the rejected response's, the loss drops along a sigmoid curve. The slope is steepest near a margin of zero — that's where the gradient is largest and the model learns fastest. When the model is already confident (margin > 4), the loss plateaus near zero and there's little signal left. Beta (β) controls how much the model is allowed to drift from the reference: low β lets the policy move freely, high β keeps it close. Typical β values are 0.1-0.5.