Narration · Module 10
Softmax + CE
0:00 / 0:00
Module 10 · Math underneath · 10 min

Softmax meets cross-entropy.

The loss function for language modeling, slowly. Where the temperature knob actually lives.

Reading time10 min Audionarration available PrerequisitesNone SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 10 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details
The big picture
The output head must answer one question: what word comes next?
After the transformer processes your input through N attention blocks, you have a vector of numbers (the hidden state). But you need a word — not a vector. Two mathematical tools bridge this gap every single time the model generates a token: softmax converts raw scores into a proper probability distribution, and cross-entropy measures how wrong that distribution was during training. Every word you have ever seen from ChatGPT, Claude, or Llama came through these two functions.
The pipeline — step by step
Step 1 of 6
Where each function lives in the LLM
Transformer blocks × N
Attention + FFN — processes meaning
Linear projection: hidden_state (D) → logits (V)
D ≈ 4,096 → V ≈ 128,000 raw scores
SOFTMAX
Probability distribution over all vocab words
128,000 numbers that sum to exactly 1.0
↓ sample one token
Next token chosen — one word generated
Training: compare to true token with cross-entropy
CROSS-ENTROPY (training only)
Loss = −log(probability assigned to correct word)
Gradient flows back — weights update — model improves
Softmax
Softmax turns raw scores into a probability distribution
The model produces one raw score (logit) per vocabulary word — but raw scores can be any number: positive, negative, very large, very small. Softmax converts them to a clean probability distribution where every number is between 0 and 1, and all numbers sum to exactly 1. The formula: for each logit zᵢ, compute e^zᵢ / Σ(e^zⱼ for all j). Try dragging the sliders below — watch the bars respond instantly.
LIVE PLAYGROUND — drag the logit sliders to reshape the distribution
Why e^x — why not just normalize?
You could divide each score by the sum (like a simple ratio). But e^x does something better: it amplifies differences. If one logit is 3 and another is 1, their ratio is 3:1 — but e³:e¹ = 20:2.7 ≈ 7.5:1. The exponential makes the most confident prediction stand out more sharply, while never allowing any probability to reach exactly 0. Every word stays in the game — just with a very tiny probability.
What "probability distribution" means
A probability distribution is a list of numbers that all add up to 1.0. If the model assigns probability 0.7 to "cat", that means: if you let this model pick a word thousands of times from this exact state, about 700 out of 1,000 times it would choose "cat". Softmax guarantees this mathematical property — the output is always a valid distribution, regardless of what the raw logits look like.
Numerical stability trick — subtract the max
Real implementations use a small but critical fix
If a logit is very large (e.g. 1,000), computing e^1000 overflows to infinity on any computer. Real implementations first subtract the maximum logit from all logits before exponentiating: softmax(z)ᵢ = e^(zᵢ−max(z)) / Σe^(zⱼ−max(z)). This produces exactly the same probabilities — because subtracting a constant from all logits cancels out — but prevents numerical overflow. This trick is in every deep learning framework.
Temperature
Temperature controls how "confident" the distribution looks
Before applying softmax, every logit is divided by a temperature value T. At T=1 (default), softmax runs normally. At T<1, logits get amplified — the distribution sharpens into a spike and the model becomes very "decisive." At T>1, logits get compressed — the distribution flattens and the model becomes more random and "creative." This is the single number behind every "creativity" slider you've seen in AI tools.
LIVE — drag temperature and watch the distribution reshape
Temperature T = 1.0 Top-P = 1.0
Effect at T = 1.0
Low T — code generation
When writing code, you want predictable, correct syntax. Temperature 0.1–0.4 makes the model focus on the single most-likely next token. In most code generators, there is exactly one right way to close a bracket or continue a function signature. Low temperature exploits this — the model acts almost deterministically.
High T — creative writing
When writing a poem or brainstorming, you want surprising word choices. Temperature 0.8–1.2 flattens the distribution so lower-probability (unexpected) words get a real chance of being selected. This is why "creative mode" in LLM tools tends to produce more vivid, unusual language — the math of temperature makes it genuinely sample from the long tail.
Top-p sampling (nucleus sampling)
A smarter alternative to temperature
Top-p sampling (Holtzman et al. 2020) takes the smallest set of tokens whose cumulative probability exceeds p (e.g. p=0.9), then samples only from that set. This adapts dynamically — when the model is confident, the nucleus is small (just a few words). When uncertain, the nucleus is large (many plausible words). This prevents both the rigidity of low temperature and the incoherence of high temperature, giving better results on most creative tasks than temperature alone.
Cross-entropy loss
Cross-entropy measures how wrong the model's probability was
During training, we know the correct next word. Cross-entropy asks: what probability did the model assign to that correct word? The formula is simply −log(p_correct). If the model was very confident and correct (p≈1.0), the loss is near 0. If the model was confident but wrong (p≈0.0), the loss is very large. The gradient of this loss tells every weight in the network which direction to move to become more correct next time.
LIVE — watch the loss change as model confidence changes
Correct next word: "sat"  ·  Context: "The cat ___"
Probability assigned to "sat" = 0.50
0.69
Cross-entropy loss
= −log(0.50) = 0.693
Perplexity = 2.0
Perfect (p=1.0, loss=0) Terrible (p=0.01, loss=4.6)
Why −log(p)?
The negative log has exactly the right shape: when p=1.0, −log(1)=0 (no loss — perfect). When p=0.5, −log(0.5)=0.69 (moderate loss). When p=0.01, −log(0.01)=4.6 (high loss). And as p→0, the loss approaches infinity — the model is punished extremely harshly for being confident and wrong. This asymmetry is intentional: being certain and wrong is much worse than being uncertain.
Perplexity = exp(loss)
Perplexity is the standard metric for language model quality. It equals e raised to the average cross-entropy loss. Intuitively, perplexity is "how many words was the model choosing between on average?" A perplexity of 10 means the model was effectively guessing among 10 equally likely words at each step. Lower is better. GPT-2 on Wikipedia: ~35. GPT-4 estimated: ~5. A perfect model would have perplexity 1.0.
Training loop
Softmax + cross-entropy together form the training loop's feedback signal
During training, the model processes a sentence and must predict each next word. Softmax converts its raw scores to probabilities. Cross-entropy compares those probabilities to the actual correct words. The resulting loss flows backwards through every weight in the network — nudging each weight slightly so the model would make a better prediction next time. This happens millions of times per second across thousands of GPUs. Every ability the model has came from this loop.
INTERACTIVE — The model predicts the next word. Click to see the loss.
Context: "The cat sat on the ___"
The model's top predictions (click one to see what the loss would be):
One training step — what actually happens
Step 1 of 7
SFT: train on completions only
During supervised fine-tuning, cross-entropy loss is computed ONLY on the assistant's response tokens, not on the user's prompt. The model is taught to generate good answers — not to reproduce questions. This is the train_on_responses_only setting in TRL's SFTTrainer. Without this, the model wastes capacity trying to predict the prompt it was given.
Loss curves — what good training looks like
During healthy training, the loss should decrease smoothly over time. Training loss dropping below 0.2 is a sign of potential overfitting — the model is memorising examples rather than learning generalisable patterns. Validation loss rising while training loss falls confirms overfitting. The ideal: both curves decrease together, converging to a low stable value.
Cross-entropy loss is the backbone of the entire LLM field. Every pre-trained model from GPT-2 to Llama 3 was trained to minimise cross-entropy on next-token prediction. Every SFT fine-tune uses cross-entropy on completions. Even DPO and GRPO implicitly rely on log-probabilities which are just a step away from cross-entropy. Understanding this loss is understanding the foundation everything else is built on.
Softmax in attention
Softmax appears twice — once in attention, once in the output head
Most students learn about softmax only at the output layer. But it plays an equally critical role deep inside every transformer block — in the attention mechanism itself. The attention softmax and the output softmax solve the same problem (turn raw scores into a valid distribution) but serve completely different purposes.
Softmax in attention — Layer 3
Attention weights: how much to look at each past word
In attention, Q·Kᵀ / √d_k produces a score for every pair of tokens (how relevant is token j to token i?). These raw scores are then passed through softmax to become attention weights — a probability distribution over all positions. This tells the model: "when thinking about position i, spend 40% of your attention on position 3, 35% on position 7, etc." The weights sum to 1 per query position, just like the output probabilities sum to 1.
Softmax in output head — Layer 6
Token probabilities: what word to generate next
At the output, the linear projection produces one logit per vocabulary word (~128,000 logits). Softmax converts these to token probabilities — the distribution you sample from to choose the next word. This softmax produces the probability that you see during training (cross-entropy uses −log of this) and at inference (temperature scales this before sampling).
Side-by-side comparison
Property Attention softmax (Layer 3) Output softmax (Layer 6)
Input Dot products Q·Kᵀ / √d_k — one per (query, key) pair Linear projection logits — one per vocabulary word
Output size Sequence_length scores per query position (e.g. 2,048) Vocabulary size probabilities (e.g. 128,000)
What it means How much attention to pay to each past position Probability of each word being the correct next token
Used for Weighted sum of Value vectors V — contextualises the representation Sampling the next token (inference) or computing loss (training)
Temperature Not directly used (but FlashAttention scales the pre-softmax scores) Yes — logits divided by T before softmax to control sharpness
Cross-entropy? No — attention weights are not compared to a target Yes — −log(p_correct) is the training loss
Key formula softmax(Q·Kᵀ / √d_k) · V softmax(W_out · h) then sample
The causal mask + softmax interaction
Why the mask uses −∞, not 0
In decoder-only models, tokens can only attend to past tokens — not future ones. The causal mask sets future positions to −∞ (negative infinity) before the attention softmax. When you compute e^(−∞), you get exactly 0 — meaning those future positions receive zero attention weight after softmax. If you used 0 instead of −∞, e^0 = 1, and future tokens would still receive a small (unwanted) weight. The −∞ trick ensures the mask is mathematically perfect.
Attention Is All You Need — Vaswani et al. 2017. Introduced both uses of softmax: in scaled dot-product attention and in the output distribution. Section 3.2.1 defines the attention softmax explicitly.
The Curious Case of Neural Text Degeneration — Holtzman et al. 2020 (arXiv:1904.09751). Introduced nucleus (top-p) sampling — showed that sampling from the full softmax distribution causes repetition and degeneration; nucleus sampling fixes this.
§ DEMO

Try it: softmax temperature.

Drag the temperature slider. Watch the next-token distribution flatten or sharpen. Sample to see what the model picks.

Softmax Temperature · interactiveOpen standalone
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.