Transformer Architecture
The transformer is a neural network architecture based on self-attention, introduced in the “Attention Is All You Need” paper (Vaswani et al., 2017). It became the universal substrate for large-language-models with GPT-2 (OpenAI, 2019). As of 2026, the architecture is fundamentally unchanged from GPT-2 in its autoregressive, decoder-only form — but with numerous engineering tweaks for efficiency.
Raschka’s Assessment
sebastian-raschka states plainly: modern frontier LLMs are still GPT-2 under the hood (fridman-lambert-raschka-2026-state-of-ai). The innovations are optimisations, not paradigm shifts:
| Innovation | What it is |
|---|---|
| mixture-of-experts (MoE) | Sparse FFN activation; router sends tokens to subset of experts; larger capacity without proportional compute |
| Group Query Attention (GQA) | Shares key/value heads across query heads; reduces KV cache memory for long-context inference |
| Multi-Head Latent Attention (MLA) | deepseek’s KV cache compression technique; further reduces memory bandwidth requirements |
| RMSNorm | Replaces LayerNorm; simpler computation, marginal quality improvement |
| SwiGLU activation | Gated nonlinearity in FFN blocks; replaced ReLU/GELU; standard since Llama 2 |
| Sliding Window Attention | Local attention window instead of full attention; OLMo 3; reduces quadratic attention cost |
| Rotary Position Embeddings (RoPE) | Encodes relative token position in attention; better length generalisation than absolute embeddings |
Alternative Architectures
Two alternative paradigms are being explored as complements or replacements:
- text-diffusion-models — generate tokens in parallel (not autoregressively), potentially faster; already deployed in code-diff startups as of early 2026
- Mamba / SSM hybrids — fixed-state RNN-like models; cheaper long-context than attention but lossy (cannot attend to arbitrary past tokens); appear in some hybrid architectures (Jamba, Zamba)
Neither has displaced the transformer at the frontier as of early 2026.