Demo · Module 23 · Interactive

Where text meets image: cross-attention.

Caption tokens query · image patches answer
Hover a word to see which patches it attends to
Hover a patch to see which words light up

Caption

IMAGE · 8 × 8 patches · click a preset to swap

CROSS-ATTENTION · caption (rows) × patches (cols)hover any cell

Hover the matrix to see which caption word attends to which image patch. Brighter cell = stronger attention weight.

brighter = more attention

What cross-attention does in CLIP-family models. Self-attention has all tokens in one stream attending to each other. Cross-attention has one stream attend to another: caption tokens (as queries) look at image patches (as keys and values). Each caption word produces a Q vector; each image patch produces a K and V vector. The dot product Q·Kᵀ tells you which patches that word "groundsfor a word like "cat" you'd see most of the attention weight concentrate on the patches where the cat actually is. This is the connective tissue behind CLIP (the open-source dual-encoder that everyone uses for retrieval), BLIP-2 (which adds a Q-Former between vision and language), Flamingo (which interleaves cross-attention layers into a frozen LLM), and image-conditioned generators like Stable Diffusion (where text embeddings cross-attend to the U-Net's latent at every denoising step).