Demo · Module 22 · Interactive

One embedding becomes three things.

Token x · multiplied by three learned matrices
W₌ → query · W₋ → key · Wᵥ → value
Per head: each projection is d×d/h

Token embedding x · "cat"

d = 64

three weight matrices ↓

W₌

64 × 64

"What am I asking?"

W₋

64 × 64

"What do I offer?"

Wᵥ

64 × 64

"What do I deliver?"

Query Q = x · W₌

Asks every other token: do you have what I need?

Key K = x · W₋

Answers other tokens' queries: here's what I have.

Value V = x · Wᵥ

Carries the actual information that flows through attention.

What the projections actually do. Every token's embedding x (a vector of length d) gets multiplied by three learned weight matrices to produce three different roles: Q (Query) is what this token is asking for, K (Key) is what this token offers in response to other queries, V (Value) is the information that flows when an attention match happens. Same token, three projections. The math is just Q = x · W_Q, K = x · W_K, V = x · W_V — each W matrix is d × d (or d × d/h per head). Why three? Because what a token asks for in a search is structurally different from what it provides to others, and different again from the payload it carries forward. Splitting them into three matrices gives the model the expressive room to learn those three roles separately. In multi-head attention, the same x gets projected through 8-128 different sets of (W_Q, W_K, W_V) — each head learns to pay attention to different patterns. The outputs are then concatenated and passed through one more matrix W_O (output projection) back to the residual stream.