RLVR (Reinforcement Learning with Verifiable Rewards)

RLVR is a post-training technique in which a language model generates answers to problems with objectively verifiable solutions (mathematics, code tests, logic puzzles), receives a binary correct/incorrect reward signal, and takes RL gradient updates to maximise correct answers. No human labellers are required.

Origin

The term was coined by the team at allen-institute-for-ai (AI2) in the Tulu 3 post-training recipe, and popularised when deepseek R1 (January 2025) demonstrated RLVR’s scaling properties at a scale and cost that shocked the industry (fridman-lambert-raschka-2026-state-of-ai).

nathan-lambert was part of the AI2 team that coined the term.

Why It Outperforms RLHF

reinforcement-learning-from-human-feedback (RLHF) trains a reward model from human preferences, then uses RL to optimise against it. This plateaus quickly: the reward model over-optimises and begins to give high scores to outputs that look good to the reward model but are actually wrong (reward hacking).

RLVR sidesteps this by using an external verifier (a math checker, a code executor) instead of a learned reward model. There is no reward model to hack — the answer is either correct or it is not.

Key scaling property: log-linear — 10× compute yields a linear improvement on evaluation benchmarks, with no observed plateau as of early 2026. Grok 4 reportedly spent compute on RL comparable to its pre-training compute budget.

Inference-Time Scaling

RLVR training naturally produces models that generate extended chain-of-thought reasoning before their final answer. This is the mechanism by which inference-time scaling (thinking tokens spent at generation time) emerges. The model learns, via RL, that taking more reasoning steps before committing improves accuracy — and that this reasoning is worth the extra compute budget.

The “Aha Moment”

During RLVR training, models reliably develop a self-correction behaviour: mid-reasoning, the model generates tokens like “I made an error, let me retry” and corrects its approach. This emerges from RL without being explicitly programmed. nathan-lambert and sebastian-raschka cite this as evidence that RLVR is qualitatively different from simple supervised learning — it produces novel problem-solving strategies (fridman-lambert-raschka-2026-state-of-ai).

RLVR 2.0

Predicted extensions:

  • Process reward models — grading intermediate reasoning steps (not just final answers), enabling denser feedback signals
  • Expansion to open-ended domains — science, medicine, law (where ground-truth is less binary but still verifiable)
  • Value functions — bringing deep RL value estimation techniques into language model training

See also: reinforcement-learning-from-human-feedback, ai-scaling-laws, synthetic-data


Source: fridman-lambert-raschka-2026-state-of-ai