Embeddings — AI Learning Course

§ 01

1 — Token bytes

Layer 0 — input tokenisation

Every word becomes a number — but some words become many numbers

BPE (Byte-Pair Encoding) merges the most frequent pairs of bytes until the vocabulary is full. Common English words become single tokens. Rare words, technical terms, and non-English text get split into multiple tokens. This directly affects how much the model "costs" to process each piece of text — and why LLMs sometimes struggle with certain languages and tasks. Click each token below to see how many bytes it uses.

Why byte count matters for LLMs

Cost at inference You pay per token, not per character. "neurotransmitter" (16 chars) costs 4 tokens in GPT-4. A Chinese document 3× longer than an English one might cost the same number of tokens — or more. Why arithmetic is hard In GPT-2/Llama-2-era tokenizers, numbers were often split digit-by-digit: "1234" → ["1","2","3","4"] = 4 tokens, forcing the model to learn that 4 separate tokens represent one number. That behaviour is now historical — modern tokenizers group digits in threes, so "1234" → ["123","4"] = 2 tokens, which makes arithmetic noticeably easier.

Vocabulary sizes across models

GPT-2 and GPT-3: 50,257 tokens (the same GPT-2 tokenizer) · GPT-3.5/GPT-4 (cl100k): ~100,000 tokens · Llama 3: 128,256 tokens · GPT-4o-era models (o200k): ~200,000 tokens · Gemma: 256,000 tokens. The trend is clearly toward larger vocabularies: more common words get single-token treatment — fewer splits, lower cost, but a bigger embedding table.

The BPE algorithm

Start with individual bytes (256 possible). Count all adjacent byte pairs in a large corpus. Merge the most frequent pair into a new token. Repeat until vocabulary is full. Running this on English text produces tokens that roughly correspond to common words and word-pieces.

§ 02

2 — Positional encoding

Layer 1 — positional embedding

Transformers are order-blind — positional encoding gives them a sense of position

A transformer's self-attention mechanism treats all tokens the same regardless of position — it is permutation-invariant. "The cat sat" and "sat cat the" would produce identical attention scores without positional information. Positional encoding solves this by adding a position-dependent signal to each token's embedding before it enters the transformer.

Why position matters

"The dog bit the man" and "The man bit the dog" contain the same words but have completely different meanings. Without positional encoding, a transformer would compute identical representations for both sentences. Position tells the model which word is the subject, which is the verb, which is the object.

The evolution

Sinusoidal (hardcoded, generalises to unseen lengths) → Learned absolute (trained, ceiling at max length seen during training) → Relative (T5, encodes distance between tokens) → ALiBi (linear bias, no explicit encoding) → RoPE (rotates Q&K vectors, dominant standard since 2022) → YaRN (extends RoPE to 500K+ context).

§ 03

3 — Semantic word map

Layer 1 — the geometry of meaning

Similar words cluster together in high-dimensional embedding space

Each word is represented as a point in a space with hundreds or thousands of dimensions. Words with similar meanings end up near each other. Relationships become directions: the direction from "man" to "woman" is the same as the direction from "king" to "queen". The 2D map below is a simplified projection of these high-dimensional clusters — hover over any word to see its semantic group.

Vector arithmetic

Meaning encodes as direction and distance

The famous equation: king − man + woman ≈ queen. This works because the direction from "man" to "woman" (the gender direction) is consistent across related word pairs. The same direction works for: actor→actress, he→she, his→her. The distance between words encodes semantic similarity: "cat" and "kitten" are close; "cat" and "democracy" are far apart.

Context reshapes meaning

The same word, different vectors

Static embeddings (Word2Vec) assign one fixed vector per word — "bank" always has the same representation. Contextual embeddings (BERT, GPT) produce different vectors for the same word depending on context. "river bank" and "savings bank" produce completely different embedding vectors for "bank" — capturing true word sense disambiguation.

§ 04

4 — Scale comparison

From Word2Vec to GPT-4 — 11 years of scaling

How embedding dimensions, parameters, training data, and cost changed at each generation

The transformer revolution was not just architectural — it was a massive scaling of every dimension simultaneously. Parameters grew from millions to trillions, training data from billions of words to trillions of tokens, and embedding dimensions from 100 to 12,288. The bars below show relative scale across four representative models.

~200×

parameter growth — GoogleNews Word2Vec (~900M params) → GPT-3 (175B)

×50,000

training data growth 2013 → 2024

embedding dims, Llama 3 (d_head per head)

Why bigger embeddings help

The naive view is that d dimensions hold d concepts — one per axis. The reality is stranger and better: in high dimensions, exponentially many directions can be nearly orthogonal to each other, so a model can pack far more features than dimensions into the same space, with each feature stored as a direction that barely interferes with the others. Anthropic's interpretability work calls this superposition — models represent many more concepts than they have dimensions by accepting a little interference between them. Bigger d means more room: less interference, finer distinctions, more precise disambiguation.

Diminishing returns

Scaling is not free. Doubling embedding dimensions roughly quadruples the weight-matrix memory and compute (the projection matrices are O(d²)), while activation memory doubles. The field is now exploring sparse attention, mixture-of-experts, and quantization to continue improving quality without proportional compute costs.

Two things the word "embedding" means today

Weight tying, and the embedding models inside RAG

Two clarifications worth carrying forward. First, weight tying: in most language models the input embedding table is shared with the output head — the same matrix that turns tokens into vectors is reused (transposed) to turn the final hidden state back into vocabulary logits. One table, two jobs, a large parameter saving. Second, when a RAG pipeline talks about its "embedding model" (the model that vectorises documents for retrieval), that is a full transformer encoder producing contextual sentence-level vectors — not Word2Vec. Static word vectors and modern embedding models share a name and a goal, but not an architecture.

§ 05

The playground.

Theory above, instrument below. This interactive panel runs live in the page — drag, type, and watch the mechanism respond.

Playground · EmbeddingsOpen full screen ↗

§ 06

What an embedding actually is.