Tokenization

Tokenization is the first step in preparing text for an LLM: raw text is split into discrete units called tokens, and each token is mapped to an integer ID from a fixed vocabulary. Neural networks can only process numbers, so this conversion is mandatory.


Why Tokenization Matters

  • LLMs operate on sequences of token IDs, not raw characters or words
  • Vocabulary size is a design parameter: too small → many unknown words; too large → sparse embeddings, slow softmax
  • The tokenizer must be consistent at training and inference time; a mismatch corrupts outputs
  • Token count determines context length and compute cost (more tokens = more attention computations)

Tokenization Pipeline

Raw text
  → Tokenize (split into tokens)
  → Build vocabulary (token → integer ID mapping)
  → Encode (text → list of IDs)          [at training and inference]
  → Decode (list of IDs → text)           [at inference output]

Every tokenizer exposes two methods: encode(text)[int] and decode([int])str.


Vocabulary Building

A vocabulary is constructed from the training corpus. Simple word-level tokenization creates one entry per unique word; the vocabulary is then sorted and given integer indices. Special tokens are added to handle edge cases.

Special TokenPurpose
<|unk|>Unknown word not seen during vocabulary construction
<|endoftext|>Separator between independent documents in concatenated training data

Byte Pair Encoding (BPE)

BPE is the tokenization algorithm used by GPT-2 and GPT-3 (and most modern LLMs). It avoids the unknown-word problem by operating at the subword level.

Algorithm:

  1. Start with a character-level vocabulary (all individual characters)
  2. Count all adjacent token pairs in the training corpus
  3. Merge the most frequent pair into a new token
  4. Repeat until the target vocabulary size is reached

Result: Common words appear as single tokens; rare or unknown words are decomposed into known subwords or characters. No <|unk|> token is needed.

Example: The word “Tokenization” might be tokenised as ["Token", "ization"]; a completely unknown word like “flumpf” might become ["f", "l", "um", "pf"].

GPT-2 vocabulary size: 50,257 tokens (50,000 BPE merges + 256 base bytes + 1 end-of-text token).


Data Sampling with a Sliding Window

Once text is tokenized, training data is created using a sliding window:

  • Input: a window of context_length consecutive token IDs
  • Target: the same window shifted one position to the right (next-token prediction)
  • The window slides by a stride across the entire corpus

This generates the (input, target) pairs that the model trains on via cross-entropy loss.


Tokenization and LLM Behaviour

  • Arithmetic: LLMs struggle with multi-digit arithmetic partly because numbers tokenize unpredictably (e.g., “1234” may be one token or several)
  • Languages: Languages with richer morphology (Finnish, Turkish) use far more tokens per word than English, increasing effective sequence length
  • Code: Code tokenizers often treat identifiers, whitespace, and operators as distinct units
  • Token efficiency: Token count directly determines inference cost; prompt engineering often tries to reduce unnecessary tokens

Source: raschka-2024-build-llm-from-scratch