Narration · Module 13
Transformer Block
0:00 / 0:00
Module 13 · Architecture · 12-15 min

One transformer block,
end to end.

Attention, residuals, normalization, FFN.

Reading time12-15 min Audionarration available Prerequisites12 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 13 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Layer 4 — Transformer Block Internals

The Factory Assembly Line making sure parts fit together perfectly

Course File: L4
The "Assembly Line" Analogy
How Data Gets Processed
Think of a Transformer Block like a car manufacturing plant. Your starting word (the "token") is just a raw piece of metal coming in.

The Attention Mechanism is the manager deciding what other words this token should connect with. But it doesn't stop there! The token has to go through the Layer Normalization (LayerNorm) quality control station to make sure the math doesn't explode. Finally, it hits the Feed-Forward Network (FFN), which acts like independent factory workers cementing the facts and logic into the token.
Interactive: The Factory Floor
Click "Send Token" to watch a word get processed through a single Transformer block. Notice the dashed Residual Connections—they act like safety bypass tracks so early information isn't lost if the machines mess up!
ATTENTION
(Connects words)
LAYER NORM
(Stabilizes Math)
FEED-FORWARD
(Stores Facts)
"cat"
Waiting for token...
Quality Control
Layer Normalization
As the AI calculates millions of numbers, some values can get way too big (like a speaker volume exploding) or incredibly small (fading out).

LayerNorm is simply a magical reset button. Every time the data passes through it, it forces the average volume back to a healthy zero, ensuring the network remains stable across 96+ consecutive layers! This depicts Pre-Norm, the modern standard where normalization happens before the sublayer, rather than after.
The Fact Database
Feed-Forward Network
While Attention handles relationships between words, the FFN is essentially the AI's encyclopedic memory. It operates on every word individually.

Modern models use SwiGLU logic gates to process these facts:
SwiGLU(x) = (xW₁ ⊗ σ(xW₁))W₂
This is where the AI stores facts. If the query was "capital of France", the FFN is the part of the brain that fires up and says, "Aha! I know this mathematical pattern maps to 'Paris'."
§ DEMO

Try it: transformer block animation.

Step through the 7 stations of one transformer block, residuals included. Hit play for the full journey.

Transformer Block Animation · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.