RLHF — Reinforcement Learning from Human Feedback

RLHF is a three-stage technique for aligning an LLM to human preferences: collect comparison data, train a reward model on it, then optimise the LLM against that reward with reinforcement learning. The approach went mainstream on the path from OpenAI's early 2017 experiments to InstructGPT and ChatGPT in 2022 — much of the 'helpful, harmless and honest' behaviour of modern assistants comes from this pipeline. Humans rank model outputs, a reward model learns those preferences, then RL algorithms like PPO push the LLM toward higher reward. Because RLHF is expensive and finicky, the field has rapidly explored simpler alternatives like DPO, RLAIF and Constitutional AI.