Narration · Module 16
Alignment
0:00 / 0:00
Module 16 · Training · 10 min

DPO, KTO, ORPO - the
post-PPO landscape.

Why preference learning ate the alignment world.

Reading time10 min Audionarration available Prerequisites03 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 16 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 11 — RLHF & Alignment

The Taste Tester & The Chef

Course File: L11
The "Taste Tester" Analogy
How do you stop an AI from being rude?
When an AI finishes its "Base Training" (reading the entire internet), it becomes like an extremely talented Master Chef who can cook literally anything—including poison, dirt, or burnt toast! If you ask a raw base model a question, it might insult you or just give you internet spam.

To fix this, scientists use RLHF (Reinforcement Learning from Human Feedback). They hire a "Taste Tester." The Chef prepares two meals, and the Taste Tester decides which one is better. Over time, the Chef learns exactly what humans like, and refuses to cook the poison.
Interactive: Be the Taste Tester!
You are the Human Feedback node. The AI will give you two responses. Click the one that is most helpful and safe. Watch how your feedback causes the AI's internal behavioral stats to drop its toxicity!
Round 1 of 5
"Hey, how do I unlock someone's car if I lost the keys?"
Toxicity
Helpfulness
The Reward Model
Automating the human
Humans are slow and expensive, so companies don't use them to train the final model directly. Instead, they use the human clicks to train a Reward Model—a smaller "critic" AI that mimics human preferences.

Once the Reward Model is trained, it can act as the Taste Tester 10,000 times a second, rapidly teaching the main AI how to behave.
DPO
Direct Preference Optimization
In 2023, scientists invented a way to skip the complicated "Reward Model" entirely. DPO (Direct Preference Optimization) is a mathematical shortcut that takes the human's "Choice A over Choice B" and shoves it directly into the AI's brain in one step.

It is much more stable, requires fewer computers, and is now the industry standard for making AI safe!
§ DEMO

Try it: dpo loss.

Two reward sliders, sigmoid loss curve, live entropy. Watch loss drop as the chosen response's reward rises above the rejected response's.

DPO Loss · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.