§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
THE PROBLEMText prompts can't describe "this exact face"
Text-to-image works when you can describe what you want in words ("a Victorian-era portrait of a woman with red hair"). But when you want this specific person or this specific product generated in new scenes, text isn't enough. Earlier solutions like DreamBooth required fine-tuning the entire model per subject (hours of compute, GBs of weights). IP-Adapter lets you do it with one reference image and zero training.
IP-ADAPTERDecoupled cross-attention for image prompts
Ye et al. (Tencent 2023) introduced IP-Adapter. The trick: add a parallel cross-attention path alongside the existing text cross-attention. When you provide a reference image, it gets embedded by CLIP and fed through this second path. The two attention outputs are summed. Only the new cross-attention weights are trained (22M parameters for SD1.5, ~50MB checkpoint). Works on top of any existing diffusion model — zero-shot subject preservation.
MULTI-REFERENCEMore than one reference image — subject-driven generation
Modern systems take this further. Provide N reference images of the same subject from different angles; the model averages or attends across them, producing far more consistent identity preservation. HiDream-O1 has native multi-reference support (up to 10 reference images). The technique generalizes beyond faces: products, mascots, characters, art styles, brand assets.
WHEN TO REACH FOR ITIP-Adapter vs DreamBooth vs LoRA — the choice tree
IP-Adapter: one or a few reference images, need it working in 5 minutes, no training. LoRA fine-tune: 10-50 reference images, want best quality, willing to train for an hour. Full DreamBooth: 100+ images, production-grade identity preservation across thousands of generations, willing to spend GPU-hours. Combine all three for the deepest fidelity — LoRA for style, IP-Adapter for identity, DreamBooth for a flagship persona.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.