Demo · Modules 06 + 21 · Interactive

Type a sentence. Watch it become tokens.

Stylized BPE tokenizer · ~50K vocab
Live byte-count + token-count
Try non-English text to see fragmentation
Input
Tokens 0 tokens
Tokens you'd never want to see

Numbers split digit-by-digit — "1234" becomes ["1","2","3","4"], four tokens of one character each. The model has to learn arithmetic across four positions, not one. Non-ASCII characters cost more because BPE was trained on mostly-English data; CJK text and emoji often use 3-4 tokens per character. Whitespace is part of the token in modern tokenizers — the leading space of " cat" is bundled with the word, which is why " cat" and "cat" are different token IDs.

What this demo fakes vs what a real tokenizer does. A real BPE tokenizer applies thousands of learned merge rules to your text in a specific order to produce a sequence of integer token IDs. This demo uses a simplified BPE-flavored heuristic that splits on common English suffixes/prefixes, keeps short common words intact, and falls back to bytes for non-ASCII characters — enough to feel the geometry. Real GPT-4 has 100,256 tokens, Llama 3 has 128,000, Gemma has 256,000. Larger vocabularies mean more common words get single-token treatment, but also a bigger embedding table.