Human Feedback
Also known as: Human Preference Data, Preference Labels, Human Evaluation
Definition
Data capturing human judgments about AI outputs—preferences between responses, quality ratings, safety flags, or corrections. Human feedback is the key input for RLHF and related alignment techniques. It represents what humans actually want from AI systems, as opposed to what models predict statistically.
What this is NOT
- Not model-generated preferences (human, not AI)
- Not implicit feedback like clicks (explicit judgments)
- Not general annotation (specifically about AI output quality)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Collected human judgments used to train reward models and align LLMs. Includes pairwise preferences ("A is better than B"), ratings ("4/5"), and corrections ("this response should instead say...").
Sources: InstructGPT paper, Anthropic HH dataset, OpenAssistant dataset
Examples
- Anthropic HH dataset: preference comparisons
- InstructGPT contractor ratings
- OpenAssistant crowdsourced conversations and preferences
- Thumbs up/down on ChatGPT responses
Counterexamples
Things that might seem like Human Feedback but are not:
- LLM-as-judge evaluations (AI, not human)
- Automated metrics (BLEU, etc.)
- Click-through rates (implicit, not explicit)
Relations
- requires rlhf (RLHF uses human feedback)
- overlapsWith annotation (Human feedback is a type of annotation)
- overlapsWith alignment (Human feedback enables alignment)
Implementations
Tools and frameworks that implement this concept: