Narration · Module 04
Vision Transformers
0:00 / 0:00
Module 04 · Image · 10-12 min

Vision Transformers
are tokens too.

ViT - 196 patches per photo, 2D positional embeddings, CNN-vs-transformer comparison.

Reading time10-12 min Audionarration available Prerequisites12 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 04 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
Slide 1 of 10
Transformers beyond text
The exact same core architecture that reads and writes can also see and listen.
The Big Idea
One universal architecture
You have already learned how transformers process textual language by natively translating syntax strings into embeddings matrices. That exact same mechanism functions perfectly on images and audio waves. The only hurdle is physically converting pure pixels and acoustic pressures into functional matrix arrays. Once translated, the Attention engine fundamentally cannot tell the difference.
Text
tokens
Image
patches
Audio
frames
Embedding
vectors
Transformer
attention × N
Output
ViT
Vision Transformers (2020)
CLIP
Contrastive Language-Image (2021)
Whisper
Speech Transcription (2022)
Slide 2 of 10
Turning a photograph into a sentence
An image is worth 16×16 words — Dosovitskiy et al., Google Brain 2020
The Core Problem
Images are 2D Grids
A transformer absolutely requires a flat sequence of tokens. The central revelation of ViT: treat a photograph identically to a sentence by algorithmically slicing it into square microscopic patches, feeding those patches as 'tokens'.
The Geometric Slicing
196 Patches per Photo
A standard 224×224 image shredded into 16×16 chunks yields exactly 196 patches. Each fragment is forcefully flattened into an exact 768-number array before reaching the mathematical encoder.
Visualizing the Patch Pipeline
224×224 image
768 Array
Flatten
Transformer
Engine
An Image is Worth 16×16 Words (ViT) — Dosovitskiy et al. The original Vision Transformer paper proving Transformers can brutally override CNNs at scale.
Slide 3 of 10
The [CLS] Summary Token
Extracting global understanding natively from fractured 196-patch grids
Classification Mechanics
Extracting a Singular Conclusion
After explicitly passing 196 microscopic image patches through the deep Transformer blocks, the system outputs 196 geometric variables. To classify an image as a exactly a "Dog", we prepend a specialized `[CLS]` token physically onto the front of the image sequence. Because of self-attention calculations, this single token actively peers into all 196 patches mathematically, effectively summarizing the whole image.
Attention Funneling
[CLS]
P₁
P₂
P₃
… P₁₉₆
[CLS] forcefully attaches its logic nodes to all Patches across every layer.
Positional Embeddings in 2D
Because native mathematical arrays don't securely understand spatial dimensions, the original image's XY coordinate structure is explicitly mathematically stamped onto every patch array before runtime.
Slide 4 of 10
ViT vs Convolutional CNNs
Divergent architectural ideologies interpreting exactly the same picture
CNN — Local First
Sliding Geometry Window
A CNN algorithmically restricts its vision, sliding a small 3×3 physical window slowly across the grid. It is forced to understand microscopic edges heavily before looking globally. Extremely data efficient, but struggles heavily evaluating objects spaced on opposite sides of a picture.
ViT — Global Dominance
Simultaneous Field Calculation
ViT abandons all strict constraints. The very first Attention layer is fully capable of calculating explicit overlaps between a pixel in the top-left and bottom-right natively in 1 step.
Modern ViT Family Tree
DeiT (2021) — Distilled ViTs mimicking CNN efficiencies to avoid immense Google data walls.

Swin Transformer (2021) — Microsoft's sliding-window hierarchal variant, crushing Object Detection ceilings.

Masked Autoencoders (MAE) — 2022's revelation. Physically blacking out 75% of image patches and forcing the algorithm to hallucinate the voids for zero-shot mastery.

ViT-22B (2023) — Google's sheer 22 Billion parameter vision supercomputer.
196
Patches for 224px
22Bn
Parameter Ceiling
75%
MAE Dropout Map
Slide 5 of 10
CLIP: Vision meets Language
Contrastive Pre-training securely binds pixel arrangements exclusively to lexical strings.
Breakthrough Algorithm
Contrastive Mapping Alignment
CLIP runs dual independent massive brains: a text transformer and an image transformer. Given an image of a dog and the caption "A fluffy dog", the objective function violently drags both mathematical representations into the exact same numerical cluster, punishing combinations explicitly that don't match.
ViT Encoder
Image
Dot Product Intersection Map
GPT Encoder
Text
Generative Engine Core
Because CLIP maps grammar precisely to pixel arrays, Generative UI arrays like DALL-E exclusively utilize frozen CLIP encoders as their internal compass dictating exactly how diffusion pixels align to prompts.
Zero-Shot Omniscience
Expose CLIP an alien photo, and ask it to compare it manually to 1000 written classes. It will instantly cluster the photo to the tightest lexical variable. It achieves ResNet-50 flawless accuracy with absolutely zero dedicated fine-tuning paths.
Slide 6 of 10
Understanding Sonic Transcription
Algorithmically morphing pressure waves completely into rigid transformer arrays
Audio Matrix Theory
16,000 Numbers per Second
Standard acoustic files capture 16kHz audio. Therefore a mere 30-second speech excerpt contains 480,000 individual variables. Submitting over a half-million data points explicitly into standard $O(N^2)$ Transformer Attention creates an instantaneous mathematical VRAM explosion.
1. Raw Waveform
480k Samples
2. Log-Mel Spectrogram
80 Bins × 3k Frames (Tractable!)
Spectrogram Compression
A Spectrogram visually graphs frequency magnitude across temporal lines. By aggressively warping this using a "Mel" filter (an algorithm mapping to biological inner-ear constraints), we drastically compress pure waveforms into highly textured visual 'images' fully capable of being interpreted by standard machine-vision networks natively.
Slide 7 of 10
Wav2Vec 2.0: Self-Supervised Speech
Training a speech engine flawlessly using purely silent unlabelled tapes
The BERT Parallel
Predicting Masked Audio Voids
Just like NLP BERT mathematically blacks out words, Wav2Vec 2.0 ingests millions of hours of unlabelled radio and aggressively blanks out the vocal track. It utilizes deep attention arrays to predict exactly what the 'missing sound' frequency must be based securely on the environmental audio padding it.
Raw Wave
Deep CNN
Features
Encoder Engine
(Random Dropout)
Reconstruction
Map
Extreme Efficiency
After completely mastering human linguistics via structural silence prediction on a Million hours, you can securely fine-tune the finalized algorithm safely using merely exactly 10 Minutes of explicit transcript labels.
Microsoft WavLM Integration
Advanced iterations purposefully inject extreme environmental chaos variables during masking (overlapping overlapping speakers simultaneously) forcing brutal denoising parameters directly into the mathematical baseline.
Slide 8 of 10
The Whisper Revolution
Throwing out self-supervision in favor of explicitly massive supervised compute
Brute Force Domination
680,000 Labelled Hours
OpenAI definitively abandoned the subtle elegant Wav2Vec approach completely. Whisper scrapes thousands of gigabytes of (audio, transcript) labels explicitly from the internet cleanly using standard pure Encoder-Decoder cross-attention architectures. Whisper performs real-time language detection, translation, and transcription concurrently!
Encoder-Decoder Native Handoff
Log-Mel
Spectrogram
Ingests ↑
Encoder
Loop ×N
Cross-Attn →
Decoder
Auto-regressor
Writes ↑
"The AI
Transcibes…"
Architectural Hallucinations
Whisper's fatal flaw is its extreme reliance on explicit language modelling constraints. Because the decoder is explicitly an LLM under the hood trying to 'read' the audio, if it hits a patch of harsh silence or aggressive static, the mathematical auto-regressor frequently violently 'guesses' a logical end-of-sentence structure, hallucinating perfect prose over purely empty static air space.
Slide 9 of 10
Modern Omniscient LLMs
Simultaneously reading Text, Vision, and Audio flawlessly within a solitary pipeline
The Great Unification Phase
Everything Degrades to Tokens
Bleeding edge monolithic architectures like Gemini 1.5, GPT-4o, and Claude 3.5 Sonnet discard utilizing different isolated brains. When every picture can be heavily smashed into a 768-D sequence, and every audio wave identically converted into discrete acoustic patch variables, you merely pipe ALL OF THEM sequentially straight into the exact same Transformer engine side-by-side using unified arrays.
Text Tokens
Image Tokens
Audio Matrix
GPT-4 Omni Engine
Live Speech
Syntax Data
Native Reaction Latency
By bypassing rigid translation chains (e.g., Speech-To-Text model → Text Engine → Text-to-Speech Engine), the GPT-4o native omni-model reacts strictly across pure acoustics simultaneously in effectively ~232ms, allowing completely flawless conversational interruptions visually relying purely on unified data streams.
Slide 10 of 10
Summary: The Cross-Modality Ecosystem
Synthesizing pure physics heavily into mathematical parameters
Sensory Modality Hardware Filter Step Flagship Models Engine Output Type
Pure Text BPE Tokenization → Dict Lookup GPT, LLaMA, Claude Auto-Regressive Generative
Photographic Square Grid Splitting → Arrays ViT, CLIP, Swin Deep Dimensional Matching
Acoustic Audio Log-Mel Convolution → Frame Arrays Whisper, Wav2Vec 2.0 Explicit Transcription
Research Directives
> ViT — Dosovitskiy et al., (arXiv:2010.11929)
> MAE — He et al., Meta 2022 (arXiv:2111.06377)
> CLIP — Radford et al., OpenAI 2021 (arXiv:2103.00020)
> Whisper — Radford et al., OpenAI 2022 (arXiv:2212.04356)
§ DEMO

Try it: vit patcher.

Watch a 224x224 image get sliced into 196 patches and become a token stream. Hover any patch to see its position in the sequence.

ViT Patcher · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.