Governance Concept

What Is RLHF (Reinforcement Learning from Human Feedback)?

RLHF (Reinforcement Learning from Human Feedback) is a training technique that uses human preference judgments to align an AI model's outputs with human values and intended behaviour.

Definition

RLHF (Reinforcement Learning from Human Feedback) — a training technique that uses human preference judgments to align an AI model's outputs with human values and intended behaviour.

RLHF is the dominant method for aligning large language models. Human raters rank model outputs; those rankings train a reward model; the language model is then optimised against that reward model. RLHF is responsible for much of the helpfulness and safety behaviour of modern chatbots. Its governance relevance: alignment achieved through RLHF is imperfect and can be circumvented (jailbreaking), and the human feedback can encode the raters' own biases.

Source: Christiano et al. (2017); Ouyang et al. (2022)

Plain-language explanation

Primary source: Christiano et al. (2017); Ouyang et al. (2022)

See where you stand on AI governance

Take the free 7-question maturity assessment and get a personalised action plan.

Free assessment — 3 minutes →