Word Embeddings

Word embeddings (or token embeddings) are dense, continuous vector representations of discrete tokens. They are the mechanism by which an LLM converts integer token IDs into numerical vectors that neural network layers can process.

Token Embeddings

A token embedding layer is a learnable lookup table of shape [vocab_size, embed_dim]. Each row is the embedding vector for one token in the vocabulary.

At initialisation: random vectors
After training: semantically meaningful — similar words have similar vectors; directions encode relationships (e.g., king − man + woman ≈ queen)
GPT-2 (124M): vocab size 50,257, embed dim 768

Operation: given a token ID, return the corresponding row vector. In code: embedding_layer[token_id].

Positional Embeddings

Self-attention is permutation-invariant — it treats all positions equally, with no built-in sense of order. Positional embeddings inject sequence position information into the input.

GPT-2 uses absolute positional embeddings:

A second learnable lookup table of shape [context_length, embed_dim]
Each position (0, 1, 2, …) gets its own learned vector
Learned jointly with the rest of the model

Alternative approaches (not in GPT-2 but common in modern models):

Sinusoidal positional encoding — fixed (not learned) sine/cosine functions; original transformer (Vaswani et al., 2017)
Rotary Position Embeddings (RoPE) — encodes relative position within the attention computation; better length generalisation; used in Llama, Mistral

Final Input Representation

input_embedding = token_embedding[token_id] + positional_embedding[position]

The two vectors are added element-wise. The resulting vector carries both semantic content (what the token means) and positional context (where it sits in the sequence). This combined embedding is the input to the first transformer block.

Embedding Dimension as a Design Parameter

Larger embedding dimensions give the model more representational capacity but increase compute and memory:

Model	Embed dim
GPT-2 (124M)	768
GPT-2 (1.5B)	1,600
GPT-3 (175B)	12,288

Source: raschka-2024-build-llm-from-scratch

My Knowledge Base

Explorer

Word Embeddings

Word Embeddings

Token Embeddings

Positional Embeddings

Final Input Representation

Embedding Dimension as a Design Parameter

Graph View

Table of Contents

Backlinks