GPT Architecture

GPT (Generative Pretrained Transformer) is the specific decoder-only transformer architecture introduced by OpenAI (Radford et al., 2018) and scaled through GPT-2, GPT-3, and ChatGPT. As of 2026, most frontier LLMs (Llama, Mistral, Qwen, DeepSeek) are variations of this architecture. It is the architecture sebastian-raschka implements from scratch in raschka-2024-build-llm-from-scratch.

GPT vs. Original Transformer

The original “Attention Is All You Need” transformer (2017) has two components:

Encoder — bidirectional attention; reads and contextualises the input sequence
Decoder — causal (masked) attention; generates output sequence token by token

GPT uses only the decoder. This simplification suits autoregressive text generation: the model produces one token at a time, conditioning on all previously generated tokens.

Full Architecture

Input token IDs
  ↓
Token Embedding (vocab_size × d_model)
  +
Positional Embedding (context_length × d_model)
  ↓
[ Transformer Block ] × N
  ↓
Final Layer Normalisation
  ↓
Linear head (d_model → vocab_size)
  ↓
Logits → Softmax → Token probabilities

Transformer Block

Each transformer block contains two sub-layers, each wrapped in a residual connection:

Sub-layer 1: Multi-Head Causal Self-Attention

Pre-LayerNorm — normalise input before attention (GPT uses pre-norm, not post-norm)
Multi-Head Attention — causal mask prevents attending to future tokens; see attention-mechanism
Dropout — applied to attention output
Residual add — add original input (skip connection)

Sub-layer 2: Feed-Forward Network (FFN)

Pre-LayerNorm — normalise before FFN
Linear (d_model → 4×d_model) — expand dimension
GELU activation — smooth non-linearity; better gradient properties than ReLU
Linear (4×d_model → d_model) — contract back
Dropout
Residual add

Why Residual Connections?

In deep networks (12–96 layers), gradients can vanish before reaching early layers during backpropagation. Residual (skip) connections provide gradient highways: gradients flow directly through the identity path, keeping training stable.

Why Pre-LayerNorm?

Layer normalisation stabilises training by ensuring each layer’s inputs have consistent mean and variance. Pre-norm (applied before each sub-layer) is more stable than post-norm (applied after), especially in deep models. Modern LLMs often use RMSNorm (a simplified LayerNorm) for efficiency.

GELU Activation

Gaussian Error Linear Unit — a smooth approximation to ReLU that allows small negative values. Better gradient flow than hard ReLU cutoff. Standard in GPT-2 and later; modern LLMs often use SwiGLU (a gated variant).

GPT Configuration Sizes

Model	Parameters	Layers (N)	Heads	d_model	Context
GPT-2 small	124M	12	12	768	1,024
GPT-2 medium	345M	24	16	1,024	1,024
GPT-2 large	762M	36	20	1,280	1,024
GPT-2 XL	1,542M	48	25	1,600	1,024
GPT-3	175B	96	96	12,288	2,048

All share the same Python class in Raschka’s implementation — only the configuration dict differs.

Text Generation

GPT generates text autoregressively (one token at a time):

Encode input text → token IDs
Forward pass → logits [batch, seq_len, vocab_size]
Take logits for the last position
Convert to probabilities via softmax
Select next token (greedy: argmax; or via sampling)
Append to input sequence → repeat from step 2

Without training, the model generates incoherent text because weights are random — the architecture is correct, but nothing has been learned yet.

Decoding strategies:

Greedy (argmax) — always picks the highest-probability token; deterministic; often repetitive
Temperature scaling — divide logits by T before softmax; T < 1 = more deterministic; T > 1 = more varied
Top-k sampling — restrict sampling to the k most probable tokens; prevents very-low-probability tokens

Modern Variations (Beyond GPT-2)

Innovation	What Changes
mixture-of-experts (MoE)	Replace dense FFN with sparse mixture of expert FFNs
Group Query Attention (GQA)	Share K/V heads across Q heads; reduce KV cache size
RMSNorm	Simpler layer normalisation
SwiGLU	Gated GELU variant in FFN; standard since Llama 2
Rotary Position Embeddings (RoPE)	Encodes relative position in attention; better length generalisation

As sebastian-raschka notes: “the transformer architecture is fundamentally unchanged from GPT-2” — these are engineering tweaks, not paradigm shifts.

Sources: raschka-2024-build-llm-from-scratch | fridman-lambert-raschka-2026-state-of-ai

My Knowledge Base

Explorer

GPT Architecture

GPT Architecture

GPT vs. Original Transformer

Full Architecture

Transformer Block

Sub-layer 1: Multi-Head Causal Self-Attention

Sub-layer 2: Feed-Forward Network (FFN)

Why Residual Connections?

Why Pre-LayerNorm?

GELU Activation

GPT Configuration Sizes

Text Generation

Modern Variations (Beyond GPT-2)

Graph View

Table of Contents

Backlinks