Module f-diff-math · Image · 8 min

Diffusion math,
slowly.

Forward + reverse process, score matching, why noise schedules matter.

Reading time8 min Audio- Prerequisites14 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Image · Generation

Diffusion Math, Slowly

Forward + reverse process · score matching · why noise schedules matter

FORWARD PROCESS

Noising a clean image, one step at a time

The forward process q(xt|xt-1) = N(xt; √(1-βt) xt-1, βt I) adds a small amount of Gaussian noise at each timestep. After T=1000 steps with a properly chosen noise schedule, the image is indistinguishable from pure Gaussian noise. The forward process has no learnable parameters — it's a fixed corruption schedule the model never has to predict.
REVERSE PROCESS

The model only has to learn one thing: predict the noise

The reverse process pθ(xt-1|xt) is what the U-Net learns. At each timestep, given the noisy image xt, predict the noise ε that was added. Subtract a fraction of that prediction, get a slightly less noisy image xt-1. Repeat T times. The loss is simply L = ||ε - εθ(xt, t)||² — mean squared error between the true noise and the predicted noise.
SCORE MATCHING

Why noise prediction is equivalent to learning the data distribution

Song et al. (2021) showed that predicting noise is mathematically equivalent to estimating the score function x log p(x) — the gradient of the log-probability of the data distribution. Score matching is a well-studied technique going back to Hyvärinen 2005. Diffusion models are score-based models in disguise. This connection unifies DDPM with the earlier NCSN family of models.
NOISE SCHEDULE

Linear vs cosine vs sigmoid — the choice that affects everything

The schedule βt controls how fast information gets destroyed. The original DDPM paper used a linear schedule from β1=0.0001 to βT=0.02. iDDPM (Nichol & Dhariwal 2021) showed a cosine schedule preserves more signal in the early timesteps, leading to better samples. Modern systems use various variance-preserving / variance-exploding schedules. Practical takeaway: the schedule choice can change FID by several points.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.