Narration · Module 24
Latent + ControlNet
0:00 / 0:00
Module 24 · Image · 8 min

Latent space
and control.

Walking the latent space. ControlNet conditioning.

Reading time8 min Audionarration available Prerequisites05, 14 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 24 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 15 — Latent Space Explorer

Moving through the mathematics of imagination

Interactive Module L15
The VAE Zip File
What is a Latent Space?
Instead of operating on raw pixels (which are heavy and slow), Stable Diffusion uses a Variational Autoencoder (VAE) to compress images down by 8x. But it doesn't compress them like a JPEG.

It compresses them into a continuous Latent Space. In this mathematical coordinate system, similar concepts group together. If you move from the "Dog" coordinate towards the "Cat" coordinate, you will smoothly interpolate through fox-like creatures! There are no visual "errors" — the space itself is meaningful.
Interactive: Drag to Explore
Imagine a tiny 2-Dimensional Latent Space. The X-axis represents the texture (Fluffy vs Scaley) and the Y-axis represents the scale (Small vs Large). Drag your mouse around the grid to watch the VAE dynamically generate the corresponding image!
Large Scale
Small Scale
Fluffy / Soft
Scaley / Hard
VAE Decoder Output
🐶
z = [0.00, 0.00]
Structural Rules
ControlNet
While the standard Latent Space is great for random generation, we often want specific compositions. ControlNet adds spatial conditioning. You can pass a stick-figure pose or a depth-map as an extra input. The diffusion process is then mathematically locked into generating the image inside those structural boundaries.
Next-Gen Diffusion
Flow Matching & Consistency Models
Standard Diffusion takes 20-50 steps of iterative denoising to produce an image. New mathematical approaches like Flow Matching and Consistency Models map the path from noise to image in straight lines rather than curved paths. This allows high-quality generation in just 1 to 4 steps, fundamentally changing real-time rendering.
§ DEMO

Try it: latent + controlnet.

Drag through 2D latent space. Toggle between 4 ControlNet conditioning modes (canny, depth, pose, segment) on the same latent point.

Latent + ControlNet · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.