Demo · Module 04 · Interactive
An image is just
more tokens.
INPUT · 224 × 224
→
PATCHED · 14 × 14 = 196 patches
TOKEN STREAM · what goes into the transformer197 tokens
Image
Patch size
What ViT does. The Vision Transformer (Dosovitskiy et al., 2020) takes the original Transformer architecture and changes one thing: instead of word tokens, it eats image patches. A 224×224 image is sliced into 14×14 = 196 non-overlapping 16×16 patches. Each patch (16·16·3 = 768 pixels) is flattened and projected to a
d-dimensional vector via a learned linear layer. A special [CLS] token is prepended (197 tokens total), and 2D positional embeddings are added so the model knows which patch came from which corner of the image. From that point on, it's the same transformer as GPT or BERT — multi-head attention, residuals, FFN, the works. The [CLS] token's final-layer representation is fed to a classification head. Hover the patches above to see them light up; the corresponding patch chip in the stream below highlights too. The slider on the right lets you try different patch sizes — smaller patches mean more tokens (and a longer attention sequence), but each token captures less local context.