Token bytes, positional encoding, the semantic word map, and how scale changes what the vectors mean.
Reading time15-20 minAudionarration availablePrerequisites21SourceTrack A · Gemini
§ VIDEO
3D walk through embedding space.
Pre-rendered animation from the Sonnet track
§ 1
What this lesson covers.
This module is one of 42 in the curriculum. Below is the canonical interactive lesson — tabs, cards, and diagrams from the source repo, rendered inside the course shell. An audio narration runs alongside it - the sticky player at the top of the page plays the full Module 05 clip.
If you prefer to read first and play with the demos after, the interactive lesson sits below this section. If you'd rather hear it narrated while you scroll, hit play on the sticky audio bar at the top — or just let it autoplay.
§ 2
The lesson itself.
Interactive lesson · ported from Gemini trackClick tabs to navigate · hover cards for details
Every word becomes a number — but some words become many numbers
BPE (Byte-Pair Encoding) merges the most frequent pairs of bytes until the vocabulary is full. Common English words become single tokens. Rare words, technical terms, and non-English text get split into multiple tokens. This directly affects how much the model "costs" to process each piece of text — and why LLMs sometimes struggle with certain languages and tasks. Click each token below to see how many bytes it uses.
Click any token to see its byte breakdown and explanation.
Why byte count matters for LLMs
Cost at inference
You pay per token, not per character. "neurotransmitter" (17 chars) costs 4 tokens in GPT-4. A Chinese document 3× longer than an English one might cost the same number of tokens — or more.
Why arithmetic is hard
Numbers are often split digit-by-digit: "1234" → ["1","2","3","4"] = 4 tokens. The model must learn that these 4 separate tokens represent a single number — a fundamentally harder task than treating "1234" as one unit.
Vocabulary sizes across models
GPT-2: 50,257 tokens · GPT-3/GPT-4: 100,256 tokens · Llama 3: 128,000 tokens · Gemma: 256,000 tokens. Larger vocabularies mean more common words get single-token treatment — fewer splits, lower cost, but a bigger embedding table.
The BPE algorithm
Start with individual bytes (256 possible). Count all adjacent byte pairs in a large corpus. Merge the most frequent pair into a new token. Repeat until vocabulary is full. Running this on English text produces tokens that roughly correspond to common words and word-pieces.
Neural Machine Translation of Rare Words with Subword Units (BPE) — Sennrich et al. 2016 (arXiv:1508.07909). Introduced BPE for NLP. Now the standard tokenisation approach for all major LLMs.
Layer 1 — positional embedding
Transformers are order-blind — positional encoding gives them a sense of position
A transformer's self-attention mechanism treats all tokens the same regardless of position — it is permutation-invariant. "The cat sat" and "sat cat the" would produce identical attention scores without positional information. Positional encoding solves this by adding a position-dependent signal to each token's embedding before it enters the transformer.
Encoding type:
Position 1Position 32
Why position matters
"The dog bit the man" and "The man bit the dog" contain the same words but have completely different meanings. Without positional encoding, a transformer would compute identical representations for both sentences. Position tells the model which word is the subject, which is the verb, which is the object.
The evolution
Sinusoidal (hardcoded, generalises to unseen lengths) → Learned absolute (trained, ceiling at max length seen during training) → Relative (T5, encodes distance between tokens) → ALiBi (linear bias, no explicit encoding) → RoPE (rotates Q&K vectors, dominant standard since 2022) → YaRN (extends RoPE to 500K+ context).
RoFormer: Enhanced Transformer with Rotary Position Embedding — Su et al. 2021 (arXiv:2104.09864). Introduced RoPE. Used in Llama, Mistral, Falcon, Gemma, and most open-weight frontier models.
Train Short, Test Long: Attention with Linear Biases (ALiBi) — Press et al. 2022 (arXiv:2108.12409). Adds a linear penalty to attention scores based on distance. No position embeddings at all.
Layer 1 — the geometry of meaning
Similar words cluster together in high-dimensional embedding space
Each word is represented as a point in a space with hundreds or thousands of dimensions. Words with similar meanings end up near each other. Relationships become directions: the direction from "man" to "woman" is the same as the direction from "king" to "queen". The 2D map below is a simplified projection of these high-dimensional clusters — hover over any word to see its semantic group.
Hover over a word to see its semantic cluster
Vector arithmetic
Meaning encodes as direction and distance
The famous equation: king − man + woman ≈ queen. This works because the direction from "man" to "woman" (the gender direction) is consistent across related word pairs. The same direction works for: actor→actress, he→she, his→her. The distance between words encodes semantic similarity: "cat" and "kitten" are close; "cat" and "democracy" are far apart.
Context reshapes meaning
The same word, different vectors
Static embeddings (Word2Vec) assign one fixed vector per word — "bank" always has the same representation. Contextual embeddings (BERT, GPT) produce different vectors for the same word depending on context. "river bank" and "savings bank" produce completely different embedding vectors for "bank" — capturing true word sense disambiguation.
Efficient Estimation of Word Representations in Vector Space (Word2Vec) — Mikolov et al., Google 2013 (arXiv:1301.3781). Showed that embedding geometry encodes semantic relationships. Made "word arithmetic" famous.
GloVe: Global Vectors for Word Representation — Pennington, Socher, Manning, Stanford 2014. Global co-occurrence matrix factorisation. Produced some of the cleanest vector analogies.
From Word2Vec to GPT-4 — 11 years of scaling
How embedding dimensions, parameters, training data, and cost changed at each generation
The transformer revolution was not just architectural — it was a massive scaling of every dimension simultaneously. Parameters grew from millions to trillions, training data from billions of words to trillions of tokens, and embedding dimensions from 100 to 12,288. The bars below show relative scale across four representative models.
×10,000
parameter growth Word2Vec → GPT-3
×50,000
training data growth 2013 → 2024
128
embedding dims, Llama 3 (d_head per head)
Why bigger embeddings help
A 100-dimensional embedding space can represent ~100 independent concepts. A 12,288-dimensional space (GPT-3) can represent ~12,288 independent directions — more nuance, more relationships, more precise disambiguation. Each additional dimension adds capacity to encode finer semantic distinctions.
Diminishing returns
Scaling is not free. Doubling embedding dimensions roughly quadruples attention computation (O(d²) for Q·K projections) and doubles memory. The field is now exploring sparse attention, mixture-of-experts, and quantization to continue improving quality without proportional compute costs.
§ DEMO
Try it: embedding walk.
Drag a token through a 2D semantic space; vectors, neighbors, and clusters update live.
Use the pager below to move sequentially through the curriculum, or jump to any module from the course index. Each track has a "Prereq: ↑ foundation" callout so you can backfill anything that wasn't clear.