Mixture of Experts (MoE)

Mixture of Experts (MoE) is a neural network architecture in which a learned router selects a subset of specialised feed-forward layers (called experts) to process each input token. Only the selected experts receive compute; the rest are idle. This sparse activation pattern allows models to have far more total parameters than any single forward pass uses, yielding higher model capacity at manageable inference cost.

How It Works

In a standard (dense) transformer, every token passes through every feed-forward network (FFN) layer. In an MoE transformer, each FFN block is replaced by N expert FFN sub-networks plus a lightweight gating/routing network. The router assigns each token to the top-K experts (typically 2 out of 8–64). Only those K experts’ weights are loaded and multiplied for that token.

The result:

  • Parameter count scales with N (number of experts)
  • Active parameters per token scales with K (number of selected experts) — far smaller
  • Throughput improves relative to a dense model of equivalent parameter count

Adoption in Frontier Models

MoE moved from research proposal to dominant production architecture in the 2022–2025 period. By early 2026, the majority of state-of-the-art large models are MoE:

ModelDeveloperSize (est.)
GPT-5 routing architectureOpenAIClassified; multi-model routing confirmed
DeepSeek V3 / R1deepseek~671B total, ~37B active
GLM-4 / Z.ai modelsZ.ai (Zhipu AI)MoE confirmed
MiniMaxMiniMaxMoE architecture
Nemotron 3 Supernvidia120B-parameter MoE, open weights
Llama-series MoEMetaVarious

jensen-huang identifies MoE as the primary architectural driver behind nvidia’s NVLink 72 rack design: large sparse models require all expert weights resident in memory simultaneously (to serve arbitrary routing decisions), which demands high aggregate memory bandwidth across many GPUs connected at near-NVRAM speeds. Without NVLink’s 3.6 TB/s bandwidth, routing latency would bottleneck the entire forward pass (fridman-huang-2026-nvidia-ai-revolution).

sebastian-raschka frames MoE as the most significant architectural addition to the standard GPT-2-lineage transformer, alongside Group Query Attention (GQA) and Multi-Head Latent Attention (MLA). While the core architecture — autoregressive decoder-only transformer with attention + FFN blocks — remains unchanged from GPT-2, MoE is the addition with the largest capacity/compute impact (fridman-lambert-raschka-2026-state-of-ai).

Relationship to Scaling Laws

MoE enables a favourable trade-off in the ai-scaling-laws pre-training axis: you can train a model with 600B parameters at the per-token compute cost of a ~30–40B dense model. This means research labs can push parameter counts (and model capacity for rare/specialised knowledge) further without hitting per-query inference cost limits.

However, MoE introduces serving complexity:

  • Load balancing — the router must distribute tokens roughly evenly across experts; uneven routing degrades GPU utilisation
  • Memory residency — all experts must be loaded even though only K are used per token; favours high-memory, high-bandwidth hardware
  • Expert collapse — without careful training regularisation, the router may route most tokens to a few experts, collapsing the benefit

Chinese Labs and MoE

As noted by nathan-lambert and sebastian-raschka, Chinese AI labs — particularly deepseek and MiniMax — heavily favour large open-weight MoEs with permissive licences (fewer restrictions than Meta’s Llama user-cap terms). This makes them attractive for enterprise fine-tuning without API dependency. The structural advantage: large MoEs look expensive on paper (total parameter count) but cost less per query than the equivalent dense model.


Sources: fridman-huang-2026-nvidia-ai-revolution | fridman-lambert-raschka-2026-state-of-ai