Transformer Architecture

The transformer is a neural network architecture based on self-attention, introduced in the “Attention Is All You Need” paper (Vaswani et al., 2017). It became the universal substrate for large-language-models with GPT-2 (OpenAI, 2019). As of 2026, the architecture is fundamentally unchanged from GPT-2 in its autoregressive, decoder-only form — but with numerous engineering tweaks for efficiency.

Raschka’s Assessment

sebastian-raschka states plainly: modern frontier LLMs are still GPT-2 under the hood (fridman-lambert-raschka-2026-state-of-ai). The innovations are optimisations, not paradigm shifts:

InnovationWhat it is
mixture-of-experts (MoE)Sparse FFN activation; router sends tokens to subset of experts; larger capacity without proportional compute
Group Query Attention (GQA)Shares key/value heads across query heads; reduces KV cache memory for long-context inference
Multi-Head Latent Attention (MLA)deepseek’s KV cache compression technique; further reduces memory bandwidth requirements
RMSNormReplaces LayerNorm; simpler computation, marginal quality improvement
SwiGLU activationGated nonlinearity in FFN blocks; replaced ReLU/GELU; standard since Llama 2
Sliding Window AttentionLocal attention window instead of full attention; OLMo 3; reduces quadratic attention cost
Rotary Position Embeddings (RoPE)Encodes relative token position in attention; better length generalisation than absolute embeddings

Alternative Architectures

Two alternative paradigms are being explored as complements or replacements:

  • text-diffusion-models — generate tokens in parallel (not autoregressively), potentially faster; already deployed in code-diff startups as of early 2026
  • Mamba / SSM hybrids — fixed-state RNN-like models; cheaper long-context than attention but lossy (cannot attend to arbitrary past tokens); appear in some hybrid architectures (Jamba, Zamba)

Neither has displaced the transformer at the frontier as of early 2026.


Source: fridman-lambert-raschka-2026-state-of-ai