RLHF

Process models published

Also known as: Reinforcement Learning from Human Feedback, Human Feedback Training, Preference Learning

Definition

A training technique that aligns LLMs with human preferences by using human feedback to train a reward model, then optimizing the LLM against that reward. RLHF is how raw pre-trained models become helpful assistants—it teaches them to produce outputs humans prefer rather than just predicting likely text.

What this is NOT

  • Not supervised fine-tuning alone (RLHF adds RL optimization)
  • Not prompting (RLHF changes model weights)
  • Not pre-training (RLHF is a post-training alignment step)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The process used by OpenAI, Anthropic, and others to create chat models from base models. Involves: (1) supervised fine-tuning on demonstrations, (2) training a reward model on human preferences, (3) reinforcement learning (PPO or similar) to optimize against the reward.

Sources: InstructGPT paper (Ouyang et al., 2022), Anthropic Constitutional AI, OpenAI RLHF documentation

Examples

  • Training ChatGPT from GPT-3.5 base with RLHF
  • Anthropic's Constitutional AI (RLHF variant)
  • Open-source RLHF with trl library
  • DPO as a simpler alternative to PPO-based RLHF

Counterexamples

Things that might seem like RLHF but are not:

  • Supervised fine-tuning alone (no RL step)
  • Prompting to get better outputs (no training)
  • Pre-training on web text (not preference-based)

Relations

  • specializes fine-tuning (RLHF is a type of fine-tuning using RL)
  • overlapsWith instruction-tuning (Often combined with instruction tuning)
  • overlapsWith alignment (RLHF is an alignment technique)

Implementations

Tools and frameworks that implement this concept: