GPT Architecture
GPT (Generative Pretrained Transformer) is the specific decoder-only transformer architecture introduced by OpenAI (Radford et al., 2018) and scaled through GPT-2, GPT-3, and ChatGPT. As of 2026, most frontier LLMs (Llama, Mistral, Qwen, DeepSeek) are variations of this architecture. It is the architecture sebastian-raschka implements from scratch in raschka-2024-build-llm-from-scratch.
GPT vs. Original Transformer
The original “Attention Is All You Need” transformer (2017) has two components:
- Encoder — bidirectional attention; reads and contextualises the input sequence
- Decoder — causal (masked) attention; generates output sequence token by token
GPT uses only the decoder. This simplification suits autoregressive text generation: the model produces one token at a time, conditioning on all previously generated tokens.
Full Architecture
Input token IDs
↓
Token Embedding (vocab_size × d_model)
+
Positional Embedding (context_length × d_model)
↓
[ Transformer Block ] × N
↓
Final Layer Normalisation
↓
Linear head (d_model → vocab_size)
↓
Logits → Softmax → Token probabilities
Transformer Block
Each transformer block contains two sub-layers, each wrapped in a residual connection:
Sub-layer 1: Multi-Head Causal Self-Attention
- Pre-LayerNorm — normalise input before attention (GPT uses pre-norm, not post-norm)
- Multi-Head Attention — causal mask prevents attending to future tokens; see attention-mechanism
- Dropout — applied to attention output
- Residual add — add original input (skip connection)
Sub-layer 2: Feed-Forward Network (FFN)
- Pre-LayerNorm — normalise before FFN
- Linear (d_model → 4×d_model) — expand dimension
- GELU activation — smooth non-linearity; better gradient properties than ReLU
- Linear (4×d_model → d_model) — contract back
- Dropout
- Residual add
Why Residual Connections?
In deep networks (12–96 layers), gradients can vanish before reaching early layers during backpropagation. Residual (skip) connections provide gradient highways: gradients flow directly through the identity path, keeping training stable.
Why Pre-LayerNorm?
Layer normalisation stabilises training by ensuring each layer’s inputs have consistent mean and variance. Pre-norm (applied before each sub-layer) is more stable than post-norm (applied after), especially in deep models. Modern LLMs often use RMSNorm (a simplified LayerNorm) for efficiency.
GELU Activation
Gaussian Error Linear Unit — a smooth approximation to ReLU that allows small negative values. Better gradient flow than hard ReLU cutoff. Standard in GPT-2 and later; modern LLMs often use SwiGLU (a gated variant).
GPT Configuration Sizes
| Model | Parameters | Layers (N) | Heads | d_model | Context |
|---|---|---|---|---|---|
| GPT-2 small | 124M | 12 | 12 | 768 | 1,024 |
| GPT-2 medium | 345M | 24 | 16 | 1,024 | 1,024 |
| GPT-2 large | 762M | 36 | 20 | 1,280 | 1,024 |
| GPT-2 XL | 1,542M | 48 | 25 | 1,600 | 1,024 |
| GPT-3 | 175B | 96 | 96 | 12,288 | 2,048 |
All share the same Python class in Raschka’s implementation — only the configuration dict differs.
Text Generation
GPT generates text autoregressively (one token at a time):
- Encode input text → token IDs
- Forward pass → logits
[batch, seq_len, vocab_size] - Take logits for the last position
- Convert to probabilities via softmax
- Select next token (greedy: argmax; or via sampling)
- Append to input sequence → repeat from step 2
Without training, the model generates incoherent text because weights are random — the architecture is correct, but nothing has been learned yet.
Decoding strategies:
- Greedy (argmax) — always picks the highest-probability token; deterministic; often repetitive
- Temperature scaling — divide logits by T before softmax; T < 1 = more deterministic; T > 1 = more varied
- Top-k sampling — restrict sampling to the k most probable tokens; prevents very-low-probability tokens
Modern Variations (Beyond GPT-2)
| Innovation | What Changes |
|---|---|
| mixture-of-experts (MoE) | Replace dense FFN with sparse mixture of expert FFNs |
| Group Query Attention (GQA) | Share K/V heads across Q heads; reduce KV cache size |
| RMSNorm | Simpler layer normalisation |
| SwiGLU | Gated GELU variant in FFN; standard since Llama 2 |
| Rotary Position Embeddings (RoPE) | Encodes relative position in attention; better length generalisation |
As sebastian-raschka notes: “the transformer architecture is fundamentally unchanged from GPT-2” — these are engineering tweaks, not paradigm shifts.
Sources: raschka-2024-build-llm-from-scratch | fridman-lambert-raschka-2026-state-of-ai