Demo · Module 04 · Interactive

An image is just
more tokens.

Slice 224×224 image into 14×14 = 196 patches
Each patch becomes a token in the residual stream
[CLS] token prepended for classification
INPUT · 224 × 224
PATCHED · 14 × 14 = 196 patches
TOKEN STREAM · what goes into the transformer197 tokens
Image
Patch size
What ViT does. The Vision Transformer (Dosovitskiy et al., 2020) takes the original Transformer architecture and changes one thing: instead of word tokens, it eats image patches. A 224×224 image is sliced into 14×14 = 196 non-overlapping 16×16 patches. Each patch (16·16·3 = 768 pixels) is flattened and projected to a d-dimensional vector via a learned linear layer. A special [CLS] token is prepended (197 tokens total), and 2D positional embeddings are added so the model knows which patch came from which corner of the image. From that point on, it's the same transformer as GPT or BERT — multi-head attention, residuals, FFN, the works. The [CLS] token's final-layer representation is fed to a classification head. Hover the patches above to see them light up; the corresponding patch chip in the stream below highlights too. The slider on the right lets you try different patch sizes — smaller patches mean more tokens (and a longer attention sequence), but each token captures less local context.