Module f-ipadapter · Image · 8 min

IP-Adapter and
personalization.

Subject-driven generation, identity preservation across new scenes.

Reading time8 min Audio- Prerequisites14 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Image · Conditioning

IP-Adapter & Subject-Driven Personalization

Identity preservation across new scenes · image-prompted generation

THE PROBLEM

Text prompts can't describe "this exact face"

Text-to-image works when you can describe what you want in words ("a Victorian-era portrait of a woman with red hair"). But when you want this specific person or this specific product generated in new scenes, text isn't enough. Earlier solutions like DreamBooth required fine-tuning the entire model per subject (hours of compute, GBs of weights). IP-Adapter lets you do it with one reference image and zero training.
IP-ADAPTER

Decoupled cross-attention for image prompts

Ye et al. (Tencent 2023) introduced IP-Adapter. The trick: add a parallel cross-attention path alongside the existing text cross-attention. When you provide a reference image, it gets embedded by CLIP and fed through this second path. The two attention outputs are summed. Only the new cross-attention weights are trained (22M parameters for SD1.5, ~50MB checkpoint). Works on top of any existing diffusion model — zero-shot subject preservation.
MULTI-REFERENCE

More than one reference image — subject-driven generation

Modern systems take this further. Provide N reference images of the same subject from different angles; the model averages or attends across them, producing far more consistent identity preservation. HiDream-O1 has native multi-reference support (up to 10 reference images). The technique generalizes beyond faces: products, mascots, characters, art styles, brand assets.
WHEN TO REACH FOR IT

IP-Adapter vs DreamBooth vs LoRA — the choice tree

IP-Adapter: one or a few reference images, need it working in 5 minutes, no training. LoRA fine-tune: 10-50 reference images, want best quality, willing to train for an hour. Full DreamBooth: 100+ images, production-grade identity preservation across thousands of generations, willing to spend GPU-hours. Combine all three for the deepest fidelity — LoRA for style, IP-Adapter for identity, DreamBooth for a flagship persona.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.