Reinforcement Learning from Human Feedback (RLHF)

RLHF is a post-training technique in which human raters compare model outputs and assign preferences. These preferences train a reward model, which is then used as the optimisation target for reinforcement learning (typically PPO) on the language model.

Role in the Post-Training Pipeline

nathan-lambert describes the standard post-training pipeline as: SFT → rlvr → RLHF (fridman-lambert-raschka-2026-state-of-ai):

SFT (supervised fine-tuning) — teaches the model format and basic instruction-following
RLVR — develops reasoning and task-solving ability from verifiable feedback; this is where skills are unlocked
RLHF — refines style, tone, formatting, and personality; finishes the surface-layer polish

Limitations vs. RLVR

RLHF reaches diminishing returns quickly because of reward hacking: the language model learns to optimise the reward model’s scores rather than genuine quality. The reward model itself is imperfect and begins assigning high scores to outputs that look correct or helpful without being so.

RLVR’s external verifier (math checker, code executor) cannot be hacked — the answer is either correct or it is not. This gives RLVR a log-linear scaling property that RLHF lacks. See rlvr.

Book Reference

nathan-lambert is the author of the RLHF Book, a widely referenced reference on post-training alignment techniques.

Source: fridman-lambert-raschka-2026-state-of-ai

My Knowledge Base

Explorer

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF)

Role in the Post-Training Pipeline

Limitations vs. RLVR

Book Reference

Graph View

Table of Contents

Backlinks