RLHF is a three-stage technique for aligning an LLM to human preferences: collect comparison data, train a reward model on it, then optimise the LLM against that reward with reinforcement learning. The approach went mainstream on the path from OpenAI's early 2017 experiments to InstructGPT and ChatGPT in 2022 — much of the 'helpful, harmless and honest' behaviour of modern assistants comes from this pipeline. Humans rank model outputs, a reward model learns those preferences, then RL algorithms like PPO push the LLM toward higher reward. Because RLHF is expensive and finicky, the field has rapidly explored simpler alternatives like DPO, RLAIF and Constitutional AI.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Advanced · 2017
RLHF — Reinforcement Learning from Human Feedback
An alignment technique that trains a reward model from human preferences and then optimises the LLM against it.
- EN — English term
- RLHF (Reinforcement Learning from Human Feedback)
- TR — Turkish term
- RLHF — İnsan Geri Bildirimiyle Pekiştirmeli Öğrenme