Module f-modern · Image · 8 min

FLUX, ERNIE,
HiDream.

Modern unified architectures - what changed when the VAE and text encoder went away.

Reading time8 min Audio- Prerequisites14 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Image · Architecture

Modern Unified Image Architectures

FLUX · ERNIE-Image · HiDream-O1 — what changed when the VAE and text encoder went away

FROM CASCADES TO UNIFIED

The bygone era of three-model pipelines

Stable Diffusion 1.x and 2.x used three separate models: a CLIP text encoder to embed the prompt, a U-Net to denoise in latent space, and a VAE decoder to turn latents into pixels. Each model was trained separately. SDXL added a second text encoder (CLIP ViT-G) for more capacity. The boundary between these stages was a source of bugs and an obstacle to scaling. Modern unified architectures like FLUX, ERNIE-Image, and HiDream-O1 collapse the pipeline into a single transformer.
FLUX

MMDiT — double-stream diffusion transformer

Black Forest Labs' FLUX (2024) uses a Multimodal Diffusion Transformer. Text and image tokens flow through two parallel streams with cross-attention layers connecting them. The text encoder is still T5 + CLIP, but everything else is one big transformer. FLUX.1 schnell distills 50 steps down to 4. FLUX.2 (2025) scales further. Open weights for the dev variant; pro variant is API-only.
ERNIE-IMAGE

Single-stream DiT with built-in prompt enhancer

Baidu's ERNIE-Image (2025) is an 8B single-stream diffusion transformer. Open Apache-2.0. Includes a Turbo variant distilled to 8 inference steps via DMD + RL. Ships a Reasoning-Driven Prompt Agent that rewrites a casual prompt into a structured, layout-aware version before passing it to the diffusion model. Strong on long-text rendering (LongText-Bench 0.97+) — the model that finally gets technical labels legible in-image.
HIDREAM-O1

Pixel-level unified transformer — no VAE, no separate text encoder

HiDream.ai's O1 (2026-05-08) goes the furthest: a single Pixel-level Unified Transformer (UiT) that natively encodes raw pixels, text, and task-specific conditions in one shared token space. No external VAE. No disjoint text encoder. 8B params, MIT license. Native 2048×2048. Supports text-to-image, instruction editing, and multi-reference subject personalization in one model. Currently #8 on Artificial Analysis Text-to-Image Arena.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.