Word Embeddings
Word embeddings (or token embeddings) are dense, continuous vector representations of discrete tokens. They are the mechanism by which an LLM converts integer token IDs into numerical vectors that neural network layers can process.
Token Embeddings
A token embedding layer is a learnable lookup table of shape [vocab_size, embed_dim]. Each row is the embedding vector for one token in the vocabulary.
- At initialisation: random vectors
- After training: semantically meaningful — similar words have similar vectors; directions encode relationships (e.g., king − man + woman ≈ queen)
- GPT-2 (124M): vocab size 50,257, embed dim 768
Operation: given a token ID, return the corresponding row vector. In code: embedding_layer[token_id].
Positional Embeddings
Self-attention is permutation-invariant — it treats all positions equally, with no built-in sense of order. Positional embeddings inject sequence position information into the input.
GPT-2 uses absolute positional embeddings:
- A second learnable lookup table of shape
[context_length, embed_dim] - Each position (0, 1, 2, …) gets its own learned vector
- Learned jointly with the rest of the model
Alternative approaches (not in GPT-2 but common in modern models):
- Sinusoidal positional encoding — fixed (not learned) sine/cosine functions; original transformer (Vaswani et al., 2017)
- Rotary Position Embeddings (RoPE) — encodes relative position within the attention computation; better length generalisation; used in Llama, Mistral
Final Input Representation
input_embedding = token_embedding[token_id] + positional_embedding[position]
The two vectors are added element-wise. The resulting vector carries both semantic content (what the token means) and positional context (where it sits in the sequence). This combined embedding is the input to the first transformer block.
Embedding Dimension as a Design Parameter
Larger embedding dimensions give the model more representational capacity but increase compute and memory:
| Model | Embed dim |
|---|---|
| GPT-2 (124M) | 768 |
| GPT-2 (1.5B) | 1,600 |
| GPT-3 (175B) | 12,288 |