Narration · Module 23
Cross-Attention
0:00 / 0:00
Module 23 · Image · 10 min

Text and vision,
connected.

How text and vision streams talk - CLIP, BLIP, any-to-any native models.

Reading time10 min Audionarration available Prerequisites12, 04 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 23 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 14 — Cross-Attention Fusion

Teaching the LLM to 'See' with Image Patches

Interactive Module L14
The Multimodal Bridge
How Text and Vision Connect
Language models only understand 1D sequences of tokens. To process an image, a Vision Transformer (ViT) first slices the image into a grid of "patches". Each patch is mathematically compressed into a dense vector (just like a word embedding!).

Instead of mixing these directly into the text stream, advanced architectures (like Flamingo or Whisper) use Cross-Attention. As the LLM thinks about the next word, it sends out a Query to the image patches, retrieving the exact visual Key/Values needed at that exact moment.
Interactive: The Cross-Attention Engine
Click "Feed Image & Ask Question". Watch as the image is tokenised into patches, and the LLM explicitly "looks" at specific visual regions to gather evidence for answering the prompt: "What animal is in the photo?"
Vision Encoder (ViT)
🤖🐶
🤖🐶
🤖🐶
🤖🐶
Loaded 120x120px image...
Text Decoder (LLM)
"What"
"animal"
"is"
"this?"
[Thinking...]
Early Fusion
Any-to-Any Native Models
The visualization above shows cross-attention (like Whisper). But modern models like GPT-4o and Gemini 1.5 use native early fusion. Audio, vision, and text tokens are mixed into the exact same stream from Layer 0. The transformer applies self-attention seamlessly across all modalities simultaneously, allowing real-time video/audio conversations without latency.
The CLIP Vector
Semantic Alignment
How does the LLM know an image patch of a "dog" relates to the text token "dog"? During pretraining, systems like CLIP process billions of image-caption pairs using a contrastive loss. This forces the Vision Encoder and the Text Encoder to map identical concepts into the exact same coordinates in latent space.
§ DEMO

Try it: cross-attention clip.

Caption tokens (rows) cross-attend to image patches (cols). Hover any word or patch to see grounding light up across the matrix.

Cross-Attention CLIP · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.