Reading time10 minAudionarration availablePrerequisitesNoneSourceTrack A · Gemini
§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 07 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
§ 2
The lesson itself.
Interactive lesson · ported from Gemini trackClick tabs to navigate · hover cards for details
Transformer Architecture Families
Encoder-only · Encoder-decoder · Decoder-only — the three design choices that define every modern LLM
Layer 5
The original 2017 design
The encoder-decoder transformer — built for translation
The original transformer (Vaswani et al. 2017) was designed for machine translation: take a sentence in German, understand it fully (encoder), then generate an English sentence word-by-word (decoder). This required two separate stacks of transformer blocks: the encoder with full bidirectional attention, and the decoder with causal attention plus a cross-attention bridge to the encoder.
Encoder stack
Input tokens + positional embedding
↓
Multi-head self-attention bidirectional — sees all tokens
↓
Add & Norm
↓
Feed-forward network
↓
Add & Norm
↓ × N
Encoder output rich context vectors
→
Decoder stack
Output tokens (shifted right)
↓
Masked self-attention causal — no future tokens
↓
Add & Norm
↓
Cross-attention Q from decoder, K+V from encoder
↓
Feed-forward + Norm
↓ × N
Softmax output next token probabilities
Cross-attention — the bridge
How the decoder reads the encoder's understanding
In cross-attention, the decoder generates Queries (Q) from its own tokens, but the Keys (K) and Values (V) come from the encoder's output. This lets the decoder ask "which part of the input is relevant right now?" at every generation step. Used in T5, BART, and Whisper (encoder reads audio, decoder generates text transcript using cross-attention to the encoder output).
Attention Is All You Need — Vaswani et al., Google Brain 2017 (arXiv:1706.03762). The original transformer paper. Introduced the encoder-decoder architecture, multi-head attention, positional encoding, and scaled dot-product attention. 90K+ citations.
Encoder-only models
Understanding-first — BERT and the bidirectional family
Encoder-only models remove the decoder entirely. They use bidirectional attention — every token can attend to every other token in both directions simultaneously. This gives each token a deeply contextual representation that depends on the full input sentence. These models excel at understanding tasks: classification, named entity recognition, question answering (extractive), semantic similarity, and search.
Encoder-only (BERT-style)
[CLS] token₁ token₂ … tokenN [SEP]
↓
Bidirectional self-attention no mask — sees left AND right context
↓
Feed-forward + Norm
↓ × N layers
[CLS] output → classification
Token outputs → NER / QA
BERT pretraining objectives
Masked Language Modelling (MLM): Randomly mask 15% of tokens. Predict them from context in both directions. Enables bidirectional understanding impossible with causal masking.
Next Sentence Prediction (NSP): Given two sentences A and B, predict whether B follows A in the original text. Teaches inter-sentence relationships. (Later shown to be less important than MLM.)
Special tokens
[CLS] — prepended to every input. Its final output vector is used for classification tasks. Analogous to ViT's CLS token (borrowed from BERT).
[MASK] — replaces masked tokens during MLM pretraining.
[SEP] — separates two segments (e.g. question vs passage in QA). Tells the model where one text ends and another begins.
Key encoder-only models
BERT-base (2018) — 110M params, 12 layers, 12 heads. The model that popularised pretraining for NLP. Google Search uses BERT variants to understand queries.
RoBERTa (2019) — BERT trained more carefully: more data, longer training, larger batches, no NSP. Consistently outperforms original BERT. Became the standard baseline.
DeBERTa (2020) — Disentangled attention: uses separate vectors for content and position. Microsoft. Often best-in-class on NLU benchmarks.
ModernBERT (2024) — Updated BERT architecture with flash attention, RoPE, and a 8,192 token context window. Trained on 2T tokens. Represents the current state of encoder-only models.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Devlin et al., Google 2018 (arXiv:1810.04805). MLM + NSP. Foundation of the BERT family. Fine-tuning BERT on 11 NLP tasks set new state of the art across all of them in 2018.
Exploring the Limits of Transfer Learning (T5) — Raffel et al., Google 2019 (arXiv:1910.10683). Encoder-decoder, text-to-text framing for every NLP task.
Decoder-only models
Generation-first — GPT and the causal family
Decoder-only models remove the encoder entirely. They use only causal (masked) self-attention — each token can only attend to tokens that came before it. This makes them natural text generators: given a context, they predict the next token, then the next, building up a response one token at a time. All modern frontier assistant models use this design.
Decoder-only (GPT-style)
token₁ token₂ … tokenN
↓
Causal self-attention upper-triangle mask — no future tokens
↓
Feed-forward + Norm
↓ × N layers
Linear + Softmax next token probabilities
↓
tokenN+1 (sampled)
The causal mask
The causal mask is an upper-triangular matrix of −∞. After softmax, −∞ becomes exactly 0 — so no attention flows from earlier positions to later ones. During training, all positions are computed simultaneously in one forward pass. At inference, each new token is generated one at a time, extending the context by 1 token per step.
The prompt IS the context
Unlike encoder-decoder models which have a separate encoding phase, decoder-only models simply prepend the prompt to the generation. The model attends back to the prompt tokens causally. This is why "system prompts" work: they are just tokens that appear before the user's input in the same sequence.
Key decoder-only models
GPT-2 (2019) — 1.5B params, 48 layers. OpenAI. First model large enough to produce fluent multi-paragraph text. Released as "too dangerous" — now considered tiny.
GPT-3 (2020) — 175B params. Few-shot learning emerged at this scale. The model that made people take LLMs seriously as general-purpose tools.
Llama 3 (2024) — Meta. Open weights. 8B, 70B, and 405B variants. 15T training tokens, RoPE positional encoding, GQA. The dominant open-source base model family.
Gemma 3 (2025) — Google. Small efficient models (1B–27B). Designed for consumer hardware. Strong on instruction following at small scale.
Mistral / Mixtral — European. Sliding window attention (SWA) and mixture of experts (MoE). Mixtral-8×7B routes each token through 2 of 8 expert FFN layers.
The convergence of the field on decoder-only
Why almost every frontier model is decoder-only
In 2018–2020, the field was divided: BERT-family for understanding, GPT-family for generation, T5-family for both. By 2023, virtually every major assistant model — GPT-4, Claude, Gemini, Llama, Mistral — is decoder-only. Here is why.
Property
Encoder-only (BERT)
Encoder-decoder (T5)
Decoder-only (GPT)
Can generate text
No (classification only)
Yes (encode then decode)
Yes (directly)
Training objective
MLM (masked tokens)
Denoising + seq2seq
Next-token prediction
Scales to any text
Limited (needs labels)
Moderate
Yes — any raw text
Architecture complexity
Simple
Complex (2 stacks + cross-attn)
Simple (1 stack)
Works with RLHF
No (can't generate)
Yes but awkward
Natural fit
Inference efficiency
Fast (classification)
Slower (2 passes)
Fast (KV cache)
Long context support
Limited
Limited
Yes (RoPE, SWA, etc.)
Example models
BERT, RoBERTa, DeBERTa
T5, BART, FLAN-T5
GPT-4, Llama 3, Claude, Gemini
The key insight
Decoder-only + RLHF = everything
The breakthrough was realising that a decoder-only model, after pretraining on next-token prediction, can be converted into a useful assistant through SFT and RLHF — without any architecture changes. The prompt becomes the "encoder input" — the model attends to it causally before generating. This simplicity means all the engineering effort can go into scaling one architecture rather than maintaining two. The field converged on decoder-only as a result.
Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al., OpenAI 2019. Showed that a decoder-only model trained on next-token prediction develops zero-shot capabilities across many tasks without any task-specific fine-tuning.
Language Models are Few-Shot Learners (GPT-3) — Brown et al., OpenAI 2020 (arXiv:2005.14165). Demonstrated that scale + decoder-only + in-context learning (few-shot prompting) could match or exceed task-specific fine-tuned models.
The model landscape
Key models from each architecture family — click a card to explore
Every model below is a transformer at heart — the differences are in which components they use, what data they trained on, and what objectives they optimised for.
Encoder-only
BERT
Google 2018
110MMLMbidirectional
RoBERTa
Meta 2019
125Mimproved BERTno NSP
DeBERTa v3
Microsoft 2021
86MdisentangledSOTA NLU
ModernBERT
HuggingFace 2024
8K contextRoPE2T tokens
Encoder-decoder
T5
Google 2019
60M–11Btext-to-textrelative pos
BART
Meta 2019
400Mdenoisingsummarisation
FLAN-T5
Google 2022
instruction tuned1800+ tasks
Whisper
OpenAI 2022
39M–1.55Bspeech96 languages
Decoder-only
GPT-3
OpenAI 2020
175Bfew-shotclosed
GPT-4o
OpenAI 2024
multimodalomniclosed
Llama 3.3
Meta 2024
70Bopen weightsRoPE
Claude 3.7
Anthropic 2025
reasoning200K ctxclosed
Gemini 2.5
Google 2025
multimodal1M ctxclosed
Mistral 7B
Mistral AI 2023
7BSWAopen
DeepSeek-R1
DeepSeek 2025
reasoningGRPOopen
Qwen 2.5
Alibaba 2024
0.5B–72Bmultilingualopen
Gemma 3
Google 2025
1B–27Bconsumer HWopen
§ DEMO
Try it: architecture toggle.
Three tabs - decoder-only, encoder-only, encoder-decoder. Animated data flow per architecture; compare what GPT, BERT, and T5 actually look like.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.