What Is RLHF (Reinforcement Learning from Human Feedback)?
RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human preference judgments to align an AI model's outputs with human values and intended behaviour.
RLHF (Reinforcement Learning from Human Feedback) — a training technique that uses human preference judgments to align an AI model's outputs with human values and intended behaviour.
RLHF is the dominant method for aligning large language models. Human raters rank model outputs; those rankings train a reward model; the language model is then optimised against that reward model. RLHF is responsible for much of the helpfulness and safety behaviour of modern chatbots. Its governance relevance: alignment achieved through RLHF is imperfect and can be circumvented (jailbreaking), and the human feedback can encode the raters' own biases.
Source: Christiano et al. (2017); Ouyang et al. (2022)
Plain-language explanation
RLHF is the dominant method for aligning large language models. Human raters rank model outputs; those rankings train a reward model; the language model is then optimised against that reward model. RLHF is responsible for much of the helpfulness and safety behaviour of modern chatbots. Its governance relevance: alignment achieved through RLHF is imperfect and can be circumvented (jailbreaking), and the human feedback can encode the raters' own biases.
See where you stand on AI governance
Take the free 7-question maturity assessment and get a personalised action plan.
Free assessment — 3 minutes →