Tokenization
Tokenization is the first step in preparing text for an LLM: raw text is split into discrete units called tokens, and each token is mapped to an integer ID from a fixed vocabulary. Neural networks can only process numbers, so this conversion is mandatory.
Why Tokenization Matters
- LLMs operate on sequences of token IDs, not raw characters or words
- Vocabulary size is a design parameter: too small → many unknown words; too large → sparse embeddings, slow softmax
- The tokenizer must be consistent at training and inference time; a mismatch corrupts outputs
- Token count determines context length and compute cost (more tokens = more attention computations)
Tokenization Pipeline
Raw text
→ Tokenize (split into tokens)
→ Build vocabulary (token → integer ID mapping)
→ Encode (text → list of IDs) [at training and inference]
→ Decode (list of IDs → text) [at inference output]
Every tokenizer exposes two methods: encode(text) → [int] and decode([int]) → str.
Vocabulary Building
A vocabulary is constructed from the training corpus. Simple word-level tokenization creates one entry per unique word; the vocabulary is then sorted and given integer indices. Special tokens are added to handle edge cases.
| Special Token | Purpose |
|---|---|
<|unk|> | Unknown word not seen during vocabulary construction |
<|endoftext|> | Separator between independent documents in concatenated training data |
Byte Pair Encoding (BPE)
BPE is the tokenization algorithm used by GPT-2 and GPT-3 (and most modern LLMs). It avoids the unknown-word problem by operating at the subword level.
Algorithm:
- Start with a character-level vocabulary (all individual characters)
- Count all adjacent token pairs in the training corpus
- Merge the most frequent pair into a new token
- Repeat until the target vocabulary size is reached
Result: Common words appear as single tokens; rare or unknown words are decomposed into known subwords or characters. No <|unk|> token is needed.
Example: The word “Tokenization” might be tokenised as ["Token", "ization"]; a completely unknown word like “flumpf” might become ["f", "l", "um", "pf"].
GPT-2 vocabulary size: 50,257 tokens (50,000 BPE merges + 256 base bytes + 1 end-of-text token).
Data Sampling with a Sliding Window
Once text is tokenized, training data is created using a sliding window:
- Input: a window of
context_lengthconsecutive token IDs - Target: the same window shifted one position to the right (next-token prediction)
- The window slides by a stride across the entire corpus
This generates the (input, target) pairs that the model trains on via cross-entropy loss.
Tokenization and LLM Behaviour
- Arithmetic: LLMs struggle with multi-digit arithmetic partly because numbers tokenize unpredictably (e.g., “1234” may be one token or several)
- Languages: Languages with richer morphology (Finnish, Turkish) use far more tokens per word than English, increasing effective sequence length
- Code: Code tokenizers often treat identifiers, whitespace, and operators as distinct units
- Token efficiency: Token count directly determines inference cost; prompt engineering often tries to reduce unnecessary tokens