AI Learning Course
08 · 1 Foundation · 4 Tracks
Text · Image · Video · Voice · All built on the same transformer math
A from-the-ground curriculum on transformer-based AI, structured by output modality rather than module number. One Common Foundation covers representation, math, architecture, training, and inference — the layers that apply to every transformer regardless of what it generates. Four modality tracks then specialize: Text, Image, Video, Voice. The same attention mechanism powers GPT, Stable Diffusion, Sora, and ElevenLabs — and seeing them side-by-side is the curriculum's organizing idea. Authored twice in parallel with two AI co-authors (Gemini + Antigravity, Sonnet 4.6) and folded into a single canonical narrative. 125 audio narrations, two interactive demos, one pre-rendered 3D embedding video. No login. No paywall.
Open the course →
§1 Common Foundation · the five layers every modality builds on
- 1A Representation — tokens, embeddings, the prequel methods (3 modules).
- 1B Math underneath — softmax, cross-entropy, why next-token prediction works (2 modules).
- 1C Architecture — attention, Q/K/V, the transformer block, encoder-decoder (6 modules).
- 1D Training — pretraining, optimizations, RLHF, DPO/KTO/ORPO, PEFT/LoRA (6 modules).
- 1E Inference — the decoding loop, the highway, offline engines, the HF registry (4 modules).
§2-§5 Four modality tracks · current depth + future slots visible
- §2 Text — RAG · chat UI · latent space. Inherits all of §1. Most complete track.
- §3 Image — Vision Transformers, cross-attention, VAE+Diffusion, ControlNet. 4 lessons + 5 future slots.
- §4 Video — AI Film Studios pioneers (Sora, Runway, Luma, Kling, Higgsfield). 1 lesson + 4 future slots.
- §5 Voice — Speech synthesis (ElevenLabs, Whisper). 1 lesson + 4 future slots.
What I learned shipping the course
- Two AI co-authors covering the same material in parallel surfaces what each tool understands and what it papers over — the diff is the lesson.
- The original module numbering hid the modality structure — module 04 ("vision-speech") is actually pure Vision Transformers; module 26 ("multimodal-film") covers both video AND speech and had to be split into 26v + 26s.
- A common foundation that all modality tracks inherit beats duplicating the basics per track — mirrors how the systems actually work.
- Visible "Coming soon" slots in thin tracks (Video, Voice) set honest expectations and show where future content lands — better than hiding the asymmetry.
- Three.js scrollytelling looks great in a demo and harms reading flow on a real lesson — the revamp drops the 3D shell and keeps the canonical content.
- Curriculum survives best when anchored on principles rather than specific products — Llama 3 and FLUX.2 examples will date; the residual stream and the diffusion process won't.