HEAD 0
Looks back.
Syntactic head. Heavy weight on the immediately-preceding token. The pattern most attention heads in early layers actually learn.
peak0.0
entropy0.0
Syntactic head. Heavy weight on the immediately-preceding token. The pattern most attention heads in early layers actually learn.
Semantic head. Pulls weight onto tokens sharing letters — a stand-in for real semantic similarity. Shows what "soft matching" looks like.
Positional head. Always pays attention to the start of the sentence. Real attention heads often learn an anchor like this for "topic state".
In a real transformer, all heads' outputs get concatenated and projected. This averaged matrix is a stylized version of "what attention contributed at this layer."
softmax(Q · Kᵀ / √dₖ) over learned projections of token embeddings. This demo fakes the projections with three hand-tuned biases — previous-token, letter-overlap, sentence-start — so you can compare them side by side. Click any token in the ribbon at the top: that becomes the "query" and the top-3 keys it attends to (per head) light up. The matrices below show every query × key pair simultaneously. Each cell in the row of the highlighted query lights up across all three heads. In a real LLM with 32 layers and 32 heads per layer, you have 1,024 of these matrices being computed per forward pass — and the model learns which patterns are useful from gradient signal alone, without anyone telling it "head 0 should look back."