Demo · Module 13 · Interactive
Watch a token move through
one transformer block.
RESIDUAL
RESIDUAL
in
x
embedding
stage 1
LayerNorm
stabilize
stage 2
Attention
QKᵀ/√d
add
+
stage 3
LayerNorm
stabilize
stage 4
FFN
d→4d→d
add
+
out
x′
next layer
This is the residual stream entering the block — a vector x of dimension d (typically 4096 for an 8B model). It carries everything the previous block knew about this token's position in the context. The block is going to read from it, add to it, and pass it forward unchanged at the structural level — that's what residual means.
Press Next → or hit Play to watch a token's journey through the block.
0 / 7
What you just watched. A real transformer block runs all of these stations in parallel for every token in the context, simultaneously, on the GPU. The block typically has 12-128 attention heads working in parallel, each computing their own QKᵀ/√d attention pattern. The FFN expands the hidden dimension by 4× (so 4096 → 16384 → 4096) — that intermediate layer is where most of the model's parameters live (~2/3 of a transformer's weights are in FFN, not attention). The two residual connections are what allow gradient to flow through 96+ blocks during training; without them, deep transformers would never converge. Stack this block 32-96 times and you have GPT-scale architecture; the final block's output goes through one more LayerNorm and a linear projection back to vocabulary size, then softmax for the next-token distribution.