Back to Glossary
GEO & AI Search

RLHF

RLHF (Reinforcement Learning from Human Feedback) is a technique that trains a reward model on human preference data and then optimizes that reward signal with reinforcement learning to align a large language model (LLM) with human intent and values. Its goal is to steer model behavior toward the outputs people actually prefer, even among answers that would otherwise look equally valid.

  • RLHF trains a reward model on human preference comparisons and then optimizes that reward with reinforcement learning to align an LLM with human intent.
  • Standard RLHF runs in three stages: supervised fine-tuning (SFT) → reward model training → PPO reinforcement learning.
  • In OpenAI's InstructGPT paper, a 1.3B-parameter RLHF model was preferred by human raters over the 100x larger 175B GPT-3.
  • The reinforcement learning stage adds a KL-divergence penalty to keep the model from drifting too far from the original and reward-hacking the objective.
  • More recently, DPO has emerged as a simpler, more stable alternative that drops the separate reward model and RL loop entirely.

Overview

RLHF (Reinforcement Learning from Human Feedback) is an alignment technique that trains a reward model on human preference data and then optimizes that reward signal with reinforcement learning to bring a large language model in line with human intent and values. A model that has only been pretrained predicts the next token well, but that alone does not guarantee it will help people in the way they actually want. RLHF closes this gap by converting human judgments about which answer is better into a learning signal that corrects the model's behavior. It is the core alignment method shared by today's leading conversational models, including ChatGPT and Claude.

RLHF matters because the goal of a "good answer" is hard to capture in an explicit formula. Qualities such as helpfulness, honesty, and harmlessness are subjective and context-dependent, so instead of ground-truth labels, RLHF collects relative human preferences (A is better than B) to indirectly define the direction the model should follow.

The three-stage pipeline

The standard form of RLHF, established by OpenAI's InstructGPT paper (Ouyang et al., 2022), consists of three stages.

Stage 1: Supervised fine-tuning (SFT)

First, the pretrained model is fine-tuned in a supervised fashion on exemplar answers (demonstration data) written by human labelers. In this stage the model learns the basic format and tone for responding to instructions, producing the initial policy that serves as the starting point for the later reinforcement learning.

Stage 2: Reward model training

For a given prompt, humans rank several outputs the model generates from best to worst. Using comparisons and rankings between outputs rather than absolute scores reduces variance across raters and yields more stable data. This preference data is used to train a reward model that assigns a scalar score to any candidate answer.

Stage 3: Reinforcement learning (PPO)

Finally, the reward model's score is used as the reward to train the LLM with reinforcement learning, typically via the PPO (Proximal Policy Optimization) algorithm. The reward function also includes a KL-divergence penalty that keeps the current policy from straying too far from the SFT starting model. Without this constraint, the model can exploit weaknesses in the reward model and produce "reward hacking" text that scores highly yet is awkward or meaningless in practice. Anthropic's work on a helpful and harmless assistant (Bai et al., 2022) even reported an approximately linear relationship between the RL reward and the square root of the KL divergence.

Significance and evidence

InstructGPT is the result that most clearly demonstrated the impact of RLHF. According to the paper, outputs from a 1.3B-parameter model aligned with RLHF were preferred by human raters over those of the 100x larger 175B GPT-3. At the same time, alignment training improved truthfulness and reduced toxic outputs while minimizing performance regressions on standard NLP benchmarks. It is a case showing that aligning a model with human feedback can be more effective for "genuinely useful answers" than simply scaling the model up.

Anthropic's research (Bai et al., 2022) applied RLHF to training a helpful and harmless assistant, proposing an approach that iteratively refreshes the reward model and policy with new human feedback each week. It also reported that alignment training improved performance on almost all NLP evaluations and did not conflict with learning specialized skills such as coding and summarization.

That said, standard RLHF can be complex and unstable to implement because it requires running both reward model training and a PPO loop. A simplified alternative is DPO (Direct Preference Optimization, Rafailov et al., 2023). DPO re-parameterizes the reward model in closed form in terms of the policy, directly optimizing preference data with a simple classification loss and no separate reward model or reinforcement learning loop. The paper reports that DPO is simpler to implement and train, more stable, and less computationally demanding than PPO-based RLHF, while matching or exceeding its performance on tasks such as sentiment control and dialogue quality. Anthropic has also proposed RLAIF (Constitutional AI, Bai et al., 2022), in which an AI rather than humans generates feedback according to a constitution of principles — a scalable approach that reduces the cost of large volumes of human labels.

References