Large Language Models (LLMs)

Large language models (LLMs) are neural networks — principally transformer-based — trained on large text corpora to predict the next token given a context. Scale in parameters, data, and compute produces emergent capabilities: reasoning, code generation, question answering, and multimodal understanding.

Architecture Evolution

Model architectures evolve roughly every six months (Huang, 2026), while hardware architectures evolve every three years — creating a co-design challenge. Key architectural developments referenced in fridman-huang-2026-nvidia-ai-revolution:

  • Transformers — original architecture; self-attention across all tokens
  • Mixture-of-Experts (MoE) — sparse activation; only a subset of model parameters engaged per token; enables larger parameter counts at lower inference cost. Drove nvidia’s NVLink 72 design.
  • SSM + Transformer hybridsnvidia’s Nemotron 3 (120B parameters, open weights) combines transformers with state-space models (SSMs), enabling efficient sequence modelling.
  • Diffusion models — NVIDIA contributed to progressive GANs and the path to diffusion, now used for image/video generation

LLMs and ai-scaling-laws

LLM capability follows power-law scaling with model size and compute. The four scaling axes (ai-scaling-laws) all apply to LLMs: pre-training, post-training refinement, test-time reasoning (chain-of-thought, search), and agentic deployment (using LLMs as the reasoning core of autonomous agents).

From Language to Physical AI

jensen-huang argues that AI is not just language: biology, chemistry, physics, weather modelling, and robotics all require domain-specific models. NVIDIA’s open-source strategy targets these modalities to ensure every industry can access frontier AI capabilities. See physical-ai and open-source-ai.

The Open-Weight Landscape (Early 2026)

nathan-lambert and sebastian-raschka document a rich open-weight ecosystem as of early 2026 (fridman-lambert-raschka-2026-state-of-ai):

ModelDeveloperNotes
DeepSeek V3 / R1deepseekMoE; permissive licence; RLVR breakthrough
Qwen 2.5 seriesAlibaba50T training tokens; various sizes
MiniMax / Kimi K2 ThinkingMiniMax / MoonshotLarge MoE; thinking model
GLM-4Z.ai (Zhipu AI)Challenging DeepSeek by early 2026
Mistral Large 3MistralEU-based; well-documented
gpt-oss-120bOpenAIOpenAI’s first open model since GPT-2
Nemotron 3 Supernvidia120B MoE; open weights + data + recipe
OLMo 3allen-institute-for-aiFully open data, code, and weights
SmolLMHuggingFaceSmall efficient models

Chinese models dominate the large-MoE tier with permissive licences. US/EU models lead in smaller, well-documented models. The motivation for open release: gaining developer mindshare globally (especially where API security concerns block Chinese-hosted inference), enabling proprietary-data fine-tuning, and (for OpenAI) offloading inference compute to the community.

Training Pipeline

nathan-lambert describes a three-phase pipeline:

  1. Pre-training — Next-token prediction on trillions of tokens. Encodes most of the model’s knowledge. Synthetic data and OCR-extracted academic PDFs (arXiv, Semantic Scholar) are highest-quality sources.
  2. Mid-training — Same algorithm focused on specific capabilities (long context, reasoning traces). Prevents catastrophic forgetting of newly desired skills. Prepares the model for RLVR.
  3. Post-training — SFT → rlvrRLHF. RLVR unlocks skills; RLHF finishes style, tone, formatting.

LLM Selection Patterns

By early 2026, users pick models based on a single memorable win, stick with them until a notable failure, then switch — analogous to browser or OS loyalty. Lambert’s personal mix: Claude Opus 4.5 for coding and philosophy; GPT-5.2 Thinking for information retrieval; Gemini for fast/search queries; Grok 4 Heavy as a debugging fallback.

See transformer-architecture for architecture details and mixture-of-experts for the dominant architectural pattern.


Sources: fridman-huang-2026-nvidia-ai-revolution | fridman-lambert-raschka-2026-state-of-ai