Narration · Module 11
Next-Token
0:00 / 0:00
Module 11 · Math underneath · 10 min

Why next-token works.

Why predicting the next token produces emergent reasoning, in-context learning, arithmetic.

Reading time10 min Audionarration available Prerequisites21, 12 SourceTrack A · Gemini
§ 1

What this lesson covers.

This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 11 clip.

If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.

§ 2

The lesson itself.

Interactive lesson · ported from Gemini track Click tabs to navigate · hover cards for details

Beyond Next-Token Prediction

What LLMs Are Actually Doing

Training objective ≠ purpose · Emergence · In-context learning · The iceberg of capability

Layer 8 — Pretraining
The central question
"If an LLM is just predicting the next word — why can it reason through a maths problem, write working code, explain quantum physics, and hold a coherent conversation?"
The distinction that matters
Training objective ≠ emergent capability ≠ deployed purpose
These three things are often conflated — and conflating them causes enormous confusion.

Training objective: predict the next token. This is the loss function. The mechanism by which weights are updated. It is a mathematical operation, not a description of what the model can do.

Emergent capability: reasoning, coding, translation, summarisation, instruction following. These were never explicitly trained for. They emerged as side effects of doing next-token prediction at scale on a sufficiently rich and diverse corpus.

Deployed purpose: be a useful assistant, coding partner, reasoning engine, or agent. This is what the model is for — achieved through SFT and RLHF on top of the pretrained base.
A thought experiment — step by step
1 / 6
Training
objective
Predict the next token. A mathematical loss. The vehicle, not the destination.
Emergent
capability
Reasoning, coding, translation — side effects of doing prediction at scale.
Deployed
purpose
Useful assistant, agent, tool. Achieved via SFT + RLHF on top of the base model.
The objective in full
Next-token prediction: the most powerful self-supervised objective ever discovered
The objective is deceptively simple: given all tokens so far, assign a probability to every possible next token. Minimise cross-entropy loss. Repeat on every position in every document in a corpus of trillions of words.

What makes this objective so powerful is what it implicitly requires. To consistently predict the next word in a physics textbook, a model must represent physics. To predict the next line of Python code, it must understand code. To predict the next word in a conversation, it must model the speaker's intent. The objective forces the model to build an internal model of the world from which text is generated — not just to memorise patterns.
Why prediction forces understanding
Physics text
Code
Mathematics
Dialogue
Logic puzzles
Multiple languages
The self-supervision advantage
Traditional supervised learning
Needs human-labelled examples for every task. "This image contains a cat." "This email is spam." Requires massive annotation effort per task. Cannot generalise beyond labelled categories. Scaling requires more human work proportionally.
Next-token prediction
Zero human labels needed. Any text is training data. The text itself is the supervision — the next word is always known. Scales perfectly: more data = just download more internet. The model learns to do thousands of tasks as a side effect of learning to predict text.
The compression argument
To predict well, you must compress the world into weights
Predicting the next token in a corpus of trillions of words is an extreme compression problem. A model that merely memorises could not generalise to new sentences. A model that truly predicts must extract the deep regularities — the grammar, the facts, the causal structures, the social norms — that generate the text. The lower the loss, the richer the internal representation. In this sense, a well-trained language model is a compressed model of the world, constructed entirely from the statistics of human writing.

This is why Ilya Sutskever (OpenAI co-founder) argued: "To predict the next token well, you need to understand the world that produced that text."
Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al., OpenAI 2019. The first paper to explicitly show that next-token prediction at scale produces zero-shot task performance — translation, summarisation, QA — with no task-specific training whatsoever.
Scaling Laws for Neural Language Models — Kaplan et al., OpenAI 2020 (arXiv:2001.08361). Showed that predictive loss scales smoothly as a power law — and that better loss consistently predicts better downstream task performance across dozens of tasks.
Emergent capabilities
Abilities that appear suddenly at scale — and were never explicitly trained for
Emergence in AI refers to capabilities that are not present in small models but appear abruptly and unpredictably as model size increases. These abilities were not programmed, not listed as training objectives, and not present in scaled-down versions of the same architecture. They arise from the interaction of scale, data diversity, and the pressure of predicting text well. The simulator below shows what a model at different training scales can and cannot do.
Emergence simulator — select training scale
Context: "The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is"
The emergence timeline — what appears at each scale
Why emergence is surprising
Loss improves smoothly — but capability appears suddenly
This is what makes emergence scientifically fascinating. If you plot cross-entropy loss against training compute, you get a smooth power-law curve — completely predictable. But if you plot "can the model do 3-digit addition?" against compute, you get a flat line at zero... then a sudden jump to near-perfect performance. The capability was not gradually improving — it crossed some threshold and appeared.

Wei et al. (2022) documented this across 137 tasks and 8 model families. Nobody fully understands why this happens — it remains one of the deepest open questions in AI research.
Emergent Abilities of Large Language Models — Wei et al., Google Brain 2022 (arXiv:2206.07682). Documented 137 tasks where emergent abilities appear. The paper that made "emergence" a central concept in AI research.
Are Emergent Abilities of Large Language Models a Mirage? — Schaeffer et al., Stanford 2023 (arXiv:2304.15004). Important counter-argument: some apparent emergence is an artifact of evaluation metrics, not a real discontinuity in capability. The debate is ongoing.
In-context learning
The ability that shouldn't exist — learning from examples without updating any weights
Standard machine learning requires training: you show the model examples, run backpropagation, and update the weights. In-context learning (ICL) is different: the model is shown examples inside the prompt itself — as text — and immediately generalises to new examples without any weight updates whatsoever. No backpropagation. No gradient. The model "learns" from context that is just tokens in the input. This emerged from GPT-3 and shocked the research community.
Prompt sent to the model — no fine-tuning, no weight updates
How ICL works — three theories
Locate and copy
Task compression
Meta-learning
Zero-shot: no examples at all
Just ask the model to do something. "Translate this to French." "Summarise this article." "Solve this maths problem." At sufficient scale, models can perform many tasks zero-shot — purely from the instruction and their training. GPT-3 showed this for the first time at scale. ChatGPT's conversational ability is largely zero-shot generalisation from instruction tuning.
Few-shot: examples in the prompt
Provide 3–10 (input, output) examples before the actual question. The model adapts to the pattern immediately. GPT-3's few-shot results matched or exceeded fine-tuned models on many benchmarks — without updating a single weight. This was the key result that made LLMs practically useful: no task-specific training needed for many tasks.
Chain-of-thought prompting
Showing the model how to think, not just what to answer
Wei et al. (2022) discovered that including reasoning steps in few-shot examples dramatically improves performance on complex tasks. Instead of showing (question, answer) pairs, you show (question, step-by-step reasoning, answer) pairs. The model learns to generate its own reasoning chains before answering — and this dramatically improves accuracy on maths, logic, and multi-step problems.

This works because reasoning is text. If the model can predict text well, and reasoning appears in training text, then the model has learned to generate reasoning. The chain-of-thought examples in the prompt simply activate this latent ability. No training required — just prompting.
Language Models are Few-Shot Learners (GPT-3) — Brown et al., OpenAI 2020 (arXiv:2005.14165). Introduced the term "in-context learning." Showed that a 175B parameter model can match fine-tuned models on many tasks using only examples in the prompt.
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models — Wei et al., Google Brain 2022 (arXiv:2201.11903). Showed that adding reasoning steps to few-shot examples produced dramatic improvements on arithmetic, commonsense, and symbolic reasoning tasks.
The iceberg metaphor
Next-token prediction is the visible tip — a vast structure of capability lies beneath
Think of an iceberg. What is visible above the waterline is the training objective: predict the next token. What is beneath the waterline — invisible but responsible for the entire structure — is the knowledge, reasoning capacity, and world model that the model must build in order to predict well. The objective is the surface. The capability is the depth.
The LLM iceberg
▲ Visible: the training objective
"Predict the next token"
Minimise cross-entropy loss · Self-supervised · No human labels · Runs on any text
▼ Beneath: what the model must learn to predict well
What must be learned to predict each type of text
1 / 7
The world model hypothesis
Some researchers (Lecun, Sutskever, others) argue that a sufficiently capable language model has implicitly built a "world model" — an internal representation of causal structure, physical laws, social dynamics, and factual knowledge. The evidence: LLMs can answer counterfactual questions ("what would happen if..."), perform analogical reasoning, and generalise to tasks they were never explicitly trained on.
The stochastic parrot counterargument
Bender et al. (2021) argued that LLMs are "stochastic parrots" — sophisticated pattern matchers that reproduce statistical regularities without genuine understanding. The lack of grounding in perception and action means there is no meaning behind the tokens. The debate is unresolved — and one of the most important open questions in AI research.
On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? — Bender et al. 2021. Influential critique of scale-maximising LLM development. Argued that fluency ≠ understanding.
Emergent Abilities of Large Language Models — Wei et al. 2022 (arXiv:2206.07682). Counter-evidence: many capabilities that would require understanding appear at scale.
Common misconceptions
Six things people say about LLMs that are wrong — or at least incomplete
These misconceptions are everywhere — in news articles, classroom discussions, and even some technical papers. They come from conflating the training objective with the model's capability, or from misunderstanding what "prediction" means at this scale.
The right mental model
How to think about it
An LLM is not a "next word predictor." It is a system trained, via next-word prediction, to build a compressed representation of human knowledge and reasoning — and then deployed, via SFT and RLHF, to make that representation useful for specific purposes. The prediction objective is the training method. The knowledge representation is the result. The assistant is the application.
The analogy: how humans learn
A child learns language by predicting — not because prediction is the goal
A human child learning to speak is implicitly doing something similar. They hear language, build internal models of what words mean and what can follow what. This learning pressure builds their conceptual understanding of the world. You would not say a child's "purpose" is to predict the next word in a sentence — even though that is roughly the training signal that drove language acquisition. The same applies to LLMs. The training objective is a means to an end. The end is a rich, useful model of how language and the world work together.
Language Models are Unsupervised Multitask Learners (GPT-2) — Radford et al. 2019. First demonstration that prediction at scale produces multi-task generalisation.
Emergent Abilities of Large Language Models — Wei et al. 2022 (arXiv:2206.07682). The canonical study of what abilities appear at what scale.
Sparks of Artificial General Intelligence: Early experiments with GPT-4 — Bubeck et al., Microsoft 2023 (arXiv:2303.12528). Controversial argument that GPT-4 shows "sparks" of AGI — wide-ranging capabilities far beyond its training objective.
§ PAPERS

Further reading.

The canonical references for this module. External links open in a new tab.

§ NEXT

What to read next.

Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.