RLHF
Also known as: Reinforcement Learning from Human Feedback, Human Feedback Training, Preference Learning
Definition
A training technique that aligns LLMs with human preferences by using human feedback to train a reward model, then optimizing the LLM against that reward. RLHF is how raw pre-trained models become helpful assistants—it teaches them to produce outputs humans prefer rather than just predicting likely text.
What this is NOT
- Not supervised fine-tuning alone (RLHF adds RL optimization)
- Not prompting (RLHF changes model weights)
- Not pre-training (RLHF is a post-training alignment step)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The process used by OpenAI, Anthropic, and others to create chat models from base models. Involves: (1) supervised fine-tuning on demonstrations, (2) training a reward model on human preferences, (3) reinforcement learning (PPO or similar) to optimize against the reward.
Sources: InstructGPT paper (Ouyang et al., 2022), Anthropic Constitutional AI, OpenAI RLHF documentation
Examples
- Training ChatGPT from GPT-3.5 base with RLHF
- Anthropic's Constitutional AI (RLHF variant)
- Open-source RLHF with trl library
- DPO as a simpler alternative to PPO-based RLHF
Counterexamples
Things that might seem like RLHF but are not:
- Supervised fine-tuning alone (no RL step)
- Prompting to get better outputs (no training)
- Pre-training on web text (not preference-based)
Relations
- specializes fine-tuning (RLHF is a type of fine-tuning using RL)
- overlapsWith instruction-tuning (Often combined with instruction tuning)
- overlapsWith alignment (RLHF is an alignment technique)
Implementations
Tools and frameworks that implement this concept:
- Axolotl secondary