What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 04 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
The lesson itself.
tokens
patches
frames
vectors
attention × N
Engine
Swin Transformer (2021) — Microsoft's sliding-window hierarchal variant, crushing Object Detection ceilings.
Masked Autoencoders (MAE) — 2022's revelation. Physically blacking out 75% of image patches and forcing the algorithm to hallucinate the voids for zero-shot mastery.
ViT-22B (2023) — Google's sheer 22 Billion parameter vision supercomputer.
Image
Text
Features
(Random Dropout)
Map
Spectrogram
Loop ×N
Auto-regressor
Transcibes…"
| Sensory Modality | Hardware Filter Step | Flagship Models | Engine Output Type |
|---|---|---|---|
| Pure Text | BPE Tokenization → Dict Lookup | GPT, LLaMA, Claude | Auto-Regressive Generative |
| Photographic | Square Grid Splitting → Arrays | ViT, CLIP, Swin | Deep Dimensional Matching |
| Acoustic Audio | Log-Mel Convolution → Frame Arrays | Whisper, Wav2Vec 2.0 | Explicit Transcription |
> MAE — He et al., Meta 2022 (arXiv:2111.06377)
> CLIP — Radford et al., OpenAI 2021 (arXiv:2103.00020)
> Whisper — Radford et al., OpenAI 2022 (arXiv:2212.04356)
Try it: vit patcher.
Watch a 224x224 image get sliced into 196 patches and become a token stream. Hover any patch to see its position in the sequence.
Further reading.
The canonical references for this module. External links open in a new tab.
- An Image is Worth 16x16 Words (ViT) — Dosovitskiy et al. 2020
What to read next.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.