What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 16 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
The lesson itself.
Layer 11 — RLHF & Alignment
The Taste Tester & The Chef
To fix this, scientists use RLHF (Reinforcement Learning from Human Feedback). They hire a "Taste Tester." The Chef prepares two meals, and the Taste Tester decides which one is better. Over time, the Chef learns exactly what humans like, and refuses to cook the poison.
Once the Reward Model is trained, it can act as the Taste Tester 10,000 times a second, rapidly teaching the main AI how to behave.
It is much more stable, requires fewer computers, and is now the industry standard for making AI safe!
Try it: dpo loss.
Two reward sliders, sigmoid loss curve, live entropy. Watch loss drop as the chosen response's reward rises above the rejected response's.
Further reading.
The canonical references for this module. External links open in a new tab.
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.