Transformer Architecture

The transformer is a neural network architecture based on self-attention, introduced in the “Attention Is All You Need” paper (Vaswani et al., 2017). It became the universal substrate for large-language-models with GPT-2 (OpenAI, 2019). As of 2026, the architecture is fundamentally unchanged from GPT-2 in its autoregressive, decoder-only form — but with numerous engineering tweaks for efficiency.

Raschka’s Assessment

sebastian-raschka states plainly: modern frontier LLMs are still GPT-2 under the hood (fridman-lambert-raschka-2026-state-of-ai). The innovations are optimisations, not paradigm shifts:

Innovation	What it is
mixture-of-experts (MoE)	Sparse FFN activation; router sends tokens to subset of experts; larger capacity without proportional compute
Group Query Attention (GQA)	Shares key/value heads across query heads; reduces KV cache memory for long-context inference
Multi-Head Latent Attention (MLA)	deepseek’s KV cache compression technique; further reduces memory bandwidth requirements
RMSNorm	Replaces LayerNorm; simpler computation, marginal quality improvement
SwiGLU activation	Gated nonlinearity in FFN blocks; replaced ReLU/GELU; standard since Llama 2
Sliding Window Attention	Local attention window instead of full attention; OLMo 3; reduces quadratic attention cost
Rotary Position Embeddings (RoPE)	Encodes relative token position in attention; better length generalisation than absolute embeddings

Alternative Architectures

Two alternative paradigms are being explored as complements or replacements:

text-diffusion-models — generate tokens in parallel (not autoregressively), potentially faster; already deployed in code-diff startups as of early 2026
Mamba / SSM hybrids — fixed-state RNN-like models; cheaper long-context than attention but lossy (cannot attend to arbitrary past tokens); appear in some hybrid architectures (Jamba, Zamba)

Neither has displaced the transformer at the frontier as of early 2026.

Implementation Details (GPT-2 from Scratch)

raschka-2024-build-llm-from-scratch provides a full from-scratch implementation. The core transformer block in GPT:

Pre-LayerNorm → Multi-Head Causal Attention → Residual Add
Pre-LayerNorm → Feed-Forward (Linear→GELU→Linear) → Residual Add

Key design choices in GPT-2:

Pre-norm (LayerNorm before sub-layer) rather than post-norm — more stable in deep networks
GELU activation in FFN (modern models use SwiGLU)
Absolute positional embeddings (modern models use RoPE)
Causal mask — upper-triangular matrix of −∞ applied before softmax prevents attention to future tokens

See gpt-architecture for full configuration tables and attention-mechanism for the Q/K/V computation.

Sources: fridman-lambert-raschka-2026-state-of-ai | raschka-2024-build-llm-from-scratch

My Knowledge Base

Explorer

Transformer Architecture

Transformer Architecture

Raschka’s Assessment

Alternative Architectures

Implementation Details (GPT-2 from Scratch)

Graph View

Table of Contents

Backlinks