Reinforcement Learning from Human Feedback (RLHF)
RLHF is a post-training technique in which human raters compare model outputs and assign preferences. These preferences train a reward model, which is then used as the optimisation target for reinforcement learning (typically PPO) on the language model.
Role in the Post-Training Pipeline
nathan-lambert describes the standard post-training pipeline as: SFT → rlvr → RLHF (fridman-lambert-raschka-2026-state-of-ai):
- SFT (supervised fine-tuning) — teaches the model format and basic instruction-following
- RLVR — develops reasoning and task-solving ability from verifiable feedback; this is where skills are unlocked
- RLHF — refines style, tone, formatting, and personality; finishes the surface-layer polish
Limitations vs. RLVR
RLHF reaches diminishing returns quickly because of reward hacking: the language model learns to optimise the reward model’s scores rather than genuine quality. The reward model itself is imperfect and begins assigning high scores to outputs that look correct or helpful without being so.
RLVR’s external verifier (math checker, code executor) cannot be hacked — the answer is either correct or it is not. This gives RLVR a log-linear scaling property that RLHF lacks. See rlvr.
Book Reference
nathan-lambert is the author of the RLHF Book, a widely referenced reference on post-training alignment techniques.