Tokenization

Tokenization is the first step in preparing text for an LLM: raw text is split into discrete units called tokens, and each token is mapped to an integer ID from a fixed vocabulary. Neural networks can only process numbers, so this conversion is mandatory.

Why Tokenization Matters

LLMs operate on sequences of token IDs, not raw characters or words
Vocabulary size is a design parameter: too small → many unknown words; too large → sparse embeddings, slow softmax
The tokenizer must be consistent at training and inference time; a mismatch corrupts outputs
Token count determines context length and compute cost (more tokens = more attention computations)

Tokenization Pipeline

Raw text
  → Tokenize (split into tokens)
  → Build vocabulary (token → integer ID mapping)
  → Encode (text → list of IDs)          [at training and inference]
  → Decode (list of IDs → text)           [at inference output]

Every tokenizer exposes two methods: encode(text) → [int] and decode([int]) → str.

Vocabulary Building

A vocabulary is constructed from the training corpus. Simple word-level tokenization creates one entry per unique word; the vocabulary is then sorted and given integer indices. Special tokens are added to handle edge cases.

Special Token	Purpose
`<\|unk\|>`	Unknown word not seen during vocabulary construction
`<\|endoftext\|>`	Separator between independent documents in concatenated training data

Byte Pair Encoding (BPE)

BPE is the tokenization algorithm used by GPT-2 and GPT-3 (and most modern LLMs). It avoids the unknown-word problem by operating at the subword level.

Algorithm:

Start with a character-level vocabulary (all individual characters)
Count all adjacent token pairs in the training corpus
Merge the most frequent pair into a new token
Repeat until the target vocabulary size is reached

Result: Common words appear as single tokens; rare or unknown words are decomposed into known subwords or characters. No <|unk|> token is needed.

Example: The word “Tokenization” might be tokenised as ["Token", "ization"]; a completely unknown word like “flumpf” might become ["f", "l", "um", "pf"].

GPT-2 vocabulary size: 50,257 tokens (50,000 BPE merges + 256 base bytes + 1 end-of-text token).

Data Sampling with a Sliding Window

Once text is tokenized, training data is created using a sliding window:

Input: a window of context_length consecutive token IDs
Target: the same window shifted one position to the right (next-token prediction)
The window slides by a stride across the entire corpus

This generates the (input, target) pairs that the model trains on via cross-entropy loss.

Tokenization and LLM Behaviour

Arithmetic: LLMs struggle with multi-digit arithmetic partly because numbers tokenize unpredictably (e.g., “1234” may be one token or several)
Languages: Languages with richer morphology (Finnish, Turkish) use far more tokens per word than English, increasing effective sequence length
Code: Code tokenizers often treat identifiers, whitespace, and operators as distinct units
Token efficiency: Token count directly determines inference cost; prompt engineering often tries to reduce unnecessary tokens

Source: raschka-2024-build-llm-from-scratch

My Knowledge Base

Explorer

Tokenization

Tokenization

Why Tokenization Matters

Tokenization Pipeline

Vocabulary Building

Byte Pair Encoding (BPE)

Data Sampling with a Sliding Window

Tokenization and LLM Behaviour

Graph View

Table of Contents

Backlinks