GPT Architecture

GPT (Generative Pretrained Transformer) is the specific decoder-only transformer architecture introduced by OpenAI (Radford et al., 2018) and scaled through GPT-2, GPT-3, and ChatGPT. As of 2026, most frontier LLMs (Llama, Mistral, Qwen, DeepSeek) are variations of this architecture. It is the architecture sebastian-raschka implements from scratch in raschka-2024-build-llm-from-scratch.


GPT vs. Original Transformer

The original “Attention Is All You Need” transformer (2017) has two components:

  • Encoder — bidirectional attention; reads and contextualises the input sequence
  • Decoder — causal (masked) attention; generates output sequence token by token

GPT uses only the decoder. This simplification suits autoregressive text generation: the model produces one token at a time, conditioning on all previously generated tokens.


Full Architecture

Input token IDs
  ↓
Token Embedding (vocab_size × d_model)
  +
Positional Embedding (context_length × d_model)
  ↓
[ Transformer Block ] × N
  ↓
Final Layer Normalisation
  ↓
Linear head (d_model → vocab_size)
  ↓
Logits → Softmax → Token probabilities

Transformer Block

Each transformer block contains two sub-layers, each wrapped in a residual connection:

Sub-layer 1: Multi-Head Causal Self-Attention

  1. Pre-LayerNorm — normalise input before attention (GPT uses pre-norm, not post-norm)
  2. Multi-Head Attention — causal mask prevents attending to future tokens; see attention-mechanism
  3. Dropout — applied to attention output
  4. Residual add — add original input (skip connection)

Sub-layer 2: Feed-Forward Network (FFN)

  1. Pre-LayerNorm — normalise before FFN
  2. Linear (d_model → 4×d_model) — expand dimension
  3. GELU activation — smooth non-linearity; better gradient properties than ReLU
  4. Linear (4×d_model → d_model) — contract back
  5. Dropout
  6. Residual add

Why Residual Connections?

In deep networks (12–96 layers), gradients can vanish before reaching early layers during backpropagation. Residual (skip) connections provide gradient highways: gradients flow directly through the identity path, keeping training stable.

Why Pre-LayerNorm?

Layer normalisation stabilises training by ensuring each layer’s inputs have consistent mean and variance. Pre-norm (applied before each sub-layer) is more stable than post-norm (applied after), especially in deep models. Modern LLMs often use RMSNorm (a simplified LayerNorm) for efficiency.

GELU Activation

Gaussian Error Linear Unit — a smooth approximation to ReLU that allows small negative values. Better gradient flow than hard ReLU cutoff. Standard in GPT-2 and later; modern LLMs often use SwiGLU (a gated variant).


GPT Configuration Sizes

ModelParametersLayers (N)Headsd_modelContext
GPT-2 small124M12127681,024
GPT-2 medium345M24161,0241,024
GPT-2 large762M36201,2801,024
GPT-2 XL1,542M48251,6001,024
GPT-3175B969612,2882,048

All share the same Python class in Raschka’s implementation — only the configuration dict differs.


Text Generation

GPT generates text autoregressively (one token at a time):

  1. Encode input text → token IDs
  2. Forward pass → logits [batch, seq_len, vocab_size]
  3. Take logits for the last position
  4. Convert to probabilities via softmax
  5. Select next token (greedy: argmax; or via sampling)
  6. Append to input sequence → repeat from step 2

Without training, the model generates incoherent text because weights are random — the architecture is correct, but nothing has been learned yet.

Decoding strategies:

  • Greedy (argmax) — always picks the highest-probability token; deterministic; often repetitive
  • Temperature scaling — divide logits by T before softmax; T < 1 = more deterministic; T > 1 = more varied
  • Top-k sampling — restrict sampling to the k most probable tokens; prevents very-low-probability tokens

Modern Variations (Beyond GPT-2)

InnovationWhat Changes
mixture-of-experts (MoE)Replace dense FFN with sparse mixture of expert FFNs
Group Query Attention (GQA)Share K/V heads across Q heads; reduce KV cache size
RMSNormSimpler layer normalisation
SwiGLUGated GELU variant in FFN; standard since Llama 2
Rotary Position Embeddings (RoPE)Encodes relative position in attention; better length generalisation

As sebastian-raschka notes: “the transformer architecture is fundamentally unchanged from GPT-2” — these are engineering tweaks, not paradigm shifts.


Sources: raschka-2024-build-llm-from-scratch | fridman-lambert-raschka-2026-state-of-ai