§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. There is no audio narration for this module - it ships as text + interactive lesson only.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
Interactive lesson · ported from Gemini track
Click tabs to navigate · hover cards for details
THE NEW DIMENSIONVideo adds time — and time breaks naive extrapolations
If diffusion can generate an image, can it generate 24 of them and call it video? No. Each frame would be coherent on its own but completely independent from the next. The result is flickering chaos. Video diffusion has to enforce temporal coherence — a person's face, clothing, lighting must stay consistent across frames, and motion must follow physics.
3D CONVOLUTIONS & TEMPORAL ATTENTIONTwo architectural choices for fusing across time
Early video diffusion (VDM, ImagenVideo) used 3D convolutions that span (height, width, time) jointly. Modern systems lean on temporal attention — for each spatial position, attend across the time axis. This is more flexible but quadratic in frame count. Practical systems use sparse temporal attention (each frame attends only to a few nearby frames) to control cost.
CASCADE PIPELINEMost video models are 2-3 stage cascades
A typical video diffusion stack: (1) Base model generates 16-32 low-resolution frames. (2) Temporal super-resolution increases frame count (16 → 60 frames). (3) Spatial super-resolution upscales each frame (256×256 → 1024×1024). Each stage has its own model. This cascade is what makes high-resolution long-form video feasible on a budget.
PHYSICS & CONSISTENCYWhat makes Sora different from Gen-1
OpenAI's Sora technical report (2024) frames video generation as world simulation. The model learns implicit physics from large-scale training (objects don't teleport, gravity applies, occluded things reappear correctly). This is what separates 2024+ video models from earlier era — not just temporal smoothness, but physical plausibility. The cost: enormous compute budgets and proprietary data.
The canonical references for this module. External links open in a new tab.
§ NEXT
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.