Reorder the modules pedagogically, not numerically. Representation → math → architecture → training → inference. Finish all five strips and you have every concept needed to read any modern transformer paper, in any modality. After that, pick a track.
BPE, byte-level BPE, sentencepiece. The cost trade-offs that show up at scale.
Token bytes, positional encoding, the semantic word map, scale.
Pre-transformer methods. The prequel that explains why context matters.
Encoder, decoder, encoder-decoder. Why GPT, BERT, and T5 look different.
What Q, K, V actually are. Why scaled dot-product, why softmax, why heads.
The mechanics of the matrices that turn embeddings into queries, keys, values.
The Bahdanau-style variant that scaled dot-product replaced.
One transformer block, end to end. The assembly line a token moves through.
A return to the encoder-decoder pattern with everything since attention as context.
Terabytes of raw text become tokens become a loss curve.
Mixed precision, gradient accumulation, gradient checkpointing, ZeRO.
The post-training pass that turns a parrot into an assistant.
What changes when reward signal comes from groups of completions.
Why preference learning ate the alignment world. Where it's still rough.
Why LoRA is cheap. What QLoRA adds. When full fine-tunes earn their cost.
KV caches, batching, speculative decoding, paged attention.
The full pipeline as one continuous picture. Where every previous module fits.
Ollama, llama.cpp, vLLM, SGLang. The exit ramp from cloud-only.
Cloning a 40 GB model. Model cards, safetensors, what to read first.
Each track inherits the foundation strips above. Cross-listed lessons (orange-bordered) appear in two tracks because the same lesson serves both. Future-module slots are visible-but-disabled cards showing where new lessons would fit.
Retrieval-augmented generation. Semantic search, hybrid retrieval, GraphRAG.
Streaming. Tool-call surfacing. Where chat-as-product diverges from chat-as-debug-console.
Drag through a small embedding space. Watch concepts cluster, watch arithmetic on vectors.
ViT — 196 patches per photo, 2D positional embeddings, CNN-vs-transformer.
How text and vision streams talk — CLIP, BLIP, any-to-any native models.
VAE compression, denoising process, CLIP text conditioning.
Conditioning generation with structural priors. Pose, depth, edges, segmentation.
Forward + reverse process, score matching, why noise schedules matter.
Modern unified architectures — what changed when the VAE and text encoder went away.
Subject-driven generation, identity preservation across new scenes.
Treating .json workflows as version-controlled artifacts. Plus interactive graph demo.
Locking a visual identity (paper grain, charcoal, signal-orange) across all your generations.
Sora, Runway Gen-3, Luma, Kling, Higgsfield — the platforms generating cinematic video from text.
Temporal coherence, frame consistency, why static-image tricks don't transfer.
What's inside a modern video diffusion model — shot composition, motion control.
Keeping a character's face, costume, and lighting consistent across hundreds of frames.
Lip-sync, beat-cut, the patterns underneath the AI Music Idols pipeline.
ElevenLabs, OpenAI Whisper — cloning resonance, cadence, breath artifacts.
Whisper architecture, alignment timestamps, language detection.
edge-tts, ElevenLabs, voice cloning — what each tier costs in latency and quality.
Single-shot vs few-shot cloning, where artifacts come from, ethical guardrails.
Sub-300ms first-syllable latency. Includes timeline demo.
Two interactive single-page demos plus one pre-rendered 3D video. Open them standalone or follow them from the lessons that embed them.
Drag through 2D semantic space. Neighbors light up, clusters activate, vectors mode shows arrows from origin.
Three heads + one average. Click a token in the ribbon to make it the query; orange arcs trace where attention flows.
Type text. Watch BPE-flavored tokenization happen live. Compare bytes / chars / words / tokens.
Drag from 0 (greedy) to 3 (chaos). Watch the next-token distribution flatten or sharpen. Sample to see what gets picked.
Step through the seven stations of one block: norm, attention, residual, norm, FFN, residual.
Drag the T-slider from pure noise to coherent image. Four target patterns; auto-animate watches a full 50-step denoise.
A 224x224 image becomes 196 patches becomes 197 tokens. Hover patches to see their position in the stream.
Three tabs: decoder-only (GPT), encoder-only (BERT), encoder-decoder (T5). See what each architecture's data flow actually looks like.
Step through decoding token by token. K and V fill up. Cache vs no-cache compute bars show why this is worth GBs of VRAM.
Drag r from 1 to 256. Trainable param count drops 256x. Switch between Llama 3 8B/13B/70B presets.
One query, four stages: embed, retrieve, augment, generate. Auto-play watches the whole pipeline run.
One token embedding becomes three vectors via three learned matrices. See Q, K, V computed live for any input token.
Token-by-token streaming, tool calls with results, throughput counter, adjustable speed. The chat UX patterns you know.
Two reward sliders, sigmoid loss curve. Watch loss drop as chosen response's reward rises above rejected.
Caption tokens cross-attend to image patches. Hover any word or patch to see grounding light up the matrix.
Drag through 2D latent space. Toggle between 4 ControlNet conditioning modes on the same latent point.
Pre-rendered 3D walk through a real embedding cluster, from the Sonnet track. Plays inline with controls.