Narration · Module 14
VAE + Diffusion
0:00 / 0:00
Module 14 · Image · 10 min

VAEs and the
diffusion process.

VAE compression, Stable Diffusion denoising, CLIP text conditioning.

Reading time10 min Audionarration available Prerequisites05, 23 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 14 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 15 — VAE & Stable Diffusion

Unblurring the Static to Generate Art

Course File: L15
The "TV Static" Analogy
How do you teach an AI to draw?
If you tell an AI "draw a cat", it has no hands and no imagination. But what if you take a picture of a cat, and slowly add "static" (like a broken old TV screen) until the cat is completely gone?

As you do this, you force the AI to watch. The AI's only job is to try and guess what the picture looked like one step before you added the static. It learns to "un-smudge" the noise. This is called a Diffusion Model.
Interactive: The Denoising Process
1. Drag the slider to the right to completely destroy the picture until it's just pure Random Noise.
2. Then, click "Reverse Diffusion (AI Un-smudge)" to watch the AI mathematically peel away the noise and generate a brand new image from scratch!
Clear Picture Noise: 0% Pure Static
Why it's so fast
VAE (The Zip File)
If you look closely at a 4K image, there are over 8 million pixels. Un-smudging 8 million pixels one by one takes forever!

Enter the Variational Autoencoder (VAE). It essentially acts like a ZIP file compressor. It shrinks the giant image down into a tiny, mathematical "Latent Space". The AI adds static to this tiny file instead of the huge image. This makes rendering images extremely fast, running easily on standard home computers.
Controlling the Art
Text Prompts (CLIP)
How does the AI know it should un-smudge the static into a Dog instead of a Cat?

We use an extra system called CLIP. CLIP connects text to images. When you type "A cyberpunk dog", CLIP translates those words into a mathematical compass. As the AI un-smudges the static, it follows the CLIP compass, shifting its brushstrokes so the final image matches your words!
§ DEMO

Try it: diffusion denoise.

Drag the T-slider from pure noise (T=1000) down to a coherent image (T=0). Try different target prompts.

Diffusion Denoise · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.