Narration · Module 12
Attention
0:00 / 0:00
Module 12 · Architecture · 20 min

Attention from first principles.

What Q, K, V actually are. Why scaled dot-product, why softmax, why heads.

Reading time20 min Audionarration available Prerequisites05 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 12 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 3 — The Attention Mechanism

The magic that lets AI "read" context like a human

Course File: L3
The "Highlight Reel" Analogy
How do you solve a mystery? You connect the clues.
Think about reading a mystery novel. When you read the word "bank", how do you know if it's a river bank or a money bank? You don't look at the word in isolation—you scan the surrounding sentences for clues. If you see the words "water" or "fishing" nearby, you mentally highlight those words to give "bank" its meaning.

This is exactly how an AI's Attention Mechanism works. When the AI processes a sentence, it doesn't just read left-to-right. Every single word acts like a detective, looking at every other word in the text and assigning an "attention score" to decide which words are the most important clues.
Interactive: Be the Detective
Hover over any word below to see what it "pays attention to". Thicker, brighter lines mean a higher Attention Score. Notice how context words heavily influence the main subjects!
The bear waded into the river bank to catch fish.
Queries, Keys, and Values
The Three Questions
Under the hood, every word asks and answers three things:

1. Query (What am I looking for?)
The word "bank" says: "I need a clue to tell me what kind of bank I am."

2. Key (What do I contain?)
The word "river" says: "I am related to water and nature."

3. Value (What is my actual meaning?)
If the Query matches the Key, the "bank" word absorbs the Value of the "river" word. Now, "bank" mathematically contains the concept of water!
Why it changed the world
Parallel Reading
Before the Attention mechanism was invented in 2017 (in a famous Google paper called Attention Is All You Need), AI read text one word at a time, like a cassette tape.

Attention lets the AI look at everything all at once. Word 1 looks at Word 100 on the exact same step that Word 100 looks at Word 1. This parallel processing is the secret behind why modern LLMs can be trained so quickly on millions of books using giant graphics cards (GPUs).
§ DEMO

Try it: attention heatmap.

Type a sentence. Click any token to make it a query. Three heads + one average compute simultaneously.

Attention Heatmap · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.