Demo · Module 12 · Interactive

Three heads. One sentence. Watch them disagree.

Causal mask · softmax(QKᵀ/√dₖ)
Click any token to make it the query
Bottom row averages all heads

Sentence

Query flow · click a token to spotlight it no query selected

HEAD 0

Looks back.

Syntactic head. Heavy weight on the immediately-preceding token. The pattern most attention heads in early layers actually learn.

peak0.0

entropy0.0

HEAD 1

Looks across.

Semantic head. Pulls weight onto tokens sharing letters — a stand-in for real semantic similarity. Shows what "soft matching" looks like.

peak0.0

entropy0.0

HEAD 2

Looks home.

Positional head. Always pays attention to the start of the sentence. Real attention heads often learn an anchor like this for "topic state".

peak0.0

entropy0.0

AVERAGE OF ALL 3 HEADS

What the block actually feeds forward.

In a real transformer, all heads' outputs get concatenated and projected. This averaged matrix is a stylized version of "what attention contributed at this layer."

Hover any cell — rows are queries (the token doing the asking), columns are keys (the available tokens). Upper triangle is masked (model can't see the future).

brighter = more attention

0.0 1.0 3 heads · 1 average · computed simultaneously

What you're seeing. Real attention weights come from softmax(Q · Kᵀ / √dₖ) over learned projections of token embeddings. This demo fakes the projections with three hand-tuned biases — previous-token, letter-overlap, sentence-start — so you can compare them side by side. Click any token in the ribbon at the top: that becomes the "query" and the top-3 keys it attends to (per head) light up. The matrices below show every query × key pair simultaneously. Each cell in the row of the highlighted query lights up across all three heads. In a real LLM with 32 layers and 32 heads per layer, you have 1,024 of these matrices being computed per forward pass — and the model learns which patterns are useful from gradient signal alone, without anyone telling it "head 0 should look back."