Demo · Module 16 · Interactive

Two slider reward signals. Watch DPO chase the gap.

L = -log σ(β·(r_chosen - r_rejected))
Loss drops as chosen rises above rejected
Plateaus when the model is already confident

Preference pair

"How do I fix this bug in my Python code?"

CHOSEN · r_c (preferred response)

The error suggests the variable is undefined. Add a check before line 42.

REJECTED · r_r (worse response)

I am not able to help with code questions.

Reward for CHOSEN response

implicit reward from policy log-prob

r_c = +1.5

Reward for REJECTED response

implicit reward from policy log-prob

r_r = -0.5

Reward margin (r_c − r_r)

2.00

DPO loss = −log σ(β·margin)

0.13

What DPO replaces. Classic RLHF uses PPO: train a separate reward model on preference data, then optimize the policy to maximize that reward (with KL penalty against a reference). Two models, two training loops, a lot of moving parts. DPO (Direct Preference Optimization) shows that you can skip the reward model entirely and optimize the policy directly on preference pairs. The loss is L = -log σ(β · (r_c - r_r)) where r is the policy's implicit log-probability ratio against the reference model. What you're seeing: as the chosen response's reward rises above the rejected response's, the loss drops along a sigmoid curve. The slope is steepest near a margin of zero — that's where the gradient is largest and the model learns fastest. When the model is already confident (margin > 4), the loss plateaus near zero and there's little signal left. Beta (β) controls how much the model is allowed to drift from the reference: low β lets the policy move freely, high β keeps it close. Typical β values are 0.1-0.5.