§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
FROM CASCADES TO UNIFIEDThe bygone era of three-model pipelines
Stable Diffusion 1.x and 2.x used three separate models: a CLIP text encoder to embed the prompt, a U-Net to denoise in latent space, and a VAE decoder to turn latents into pixels. Each model was trained separately. SDXL added a second text encoder (CLIP ViT-G) for more capacity. The boundary between these stages was a source of bugs and an obstacle to scaling. Modern unified architectures like FLUX, ERNIE-Image, and HiDream-O1 collapse the pipeline into a single transformer.
FLUXMMDiT — double-stream diffusion transformer
Black Forest Labs' FLUX (2024) uses a Multimodal Diffusion Transformer. Text and image tokens flow through two parallel streams with cross-attention layers connecting them. The text encoder is still T5 + CLIP, but everything else is one big transformer. FLUX.1 schnell distills 50 steps down to 4. FLUX.2 (2025) scales further. Open weights for the dev variant; pro variant is API-only.
ERNIE-IMAGESingle-stream DiT with built-in prompt enhancer
Baidu's ERNIE-Image (2025) is an 8B single-stream diffusion transformer. Open Apache-2.0. Includes a Turbo variant distilled to 8 inference steps via DMD + RL. Ships a Reasoning-Driven Prompt Agent that rewrites a casual prompt into a structured, layout-aware version before passing it to the diffusion model. Strong on long-text rendering (LongText-Bench 0.97+) — the model that finally gets technical labels legible in-image.
HIDREAM-O1Pixel-level unified transformer — no VAE, no separate text encoder
HiDream.ai's O1 (2026-05-08) goes the furthest: a single Pixel-level Unified Transformer (UiT) that natively encodes raw pixels, text, and task-specific conditions in one shared token space. No external VAE. No disjoint text encoder. 8B params, MIT license. Native 2048×2048. Supports text-to-image, instruction editing, and multi-reference subject personalization in one model. Currently #8 on Artificial Analysis Text-to-Image Arena.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.