Demo · Module 07 · Interactive

Three transformer architectures.
One toggle to tell them apart.

Encoder-only · Decoder-only · Encoder-Decoder
Animated data flow per architecture
Click each tab to switch

Why three architectures. The original 2017 Transformer was encoder-decoder, designed for machine translation. After that, the field bifurcated. Decoder-only (GPT-style) ate the language-modeling world because next-token prediction generalizes — you can train it on any text and get a model that does everything via in-context learning. Encoder-only (BERT-style) is what you reach for when you need a representation rather than a generation — embeddings for search, classification heads on top, NER, etc. The trick is the bidirectional attention: every token sees every other token. Encoder-decoder (T5-style) is still the right shape when input and output are structurally different — translation, summarization, semantic parsing. Cross-attention is what connects the two halves: the decoder reads the encoder's understanding while generating its own token stream.

Three transformer architectures.One toggle to tell them apart.

Three transformer architectures.
One toggle to tell them apart.