Alignment

Property evaluation published

Also known as: AI Alignment, Value Alignment, Model Alignment

Definition

The degree to which an AI system's behavior matches intended goals, values, and constraints. For LLMs, alignment means the model is helpful, harmless, and honest—it does what users want, avoids harmful outputs, and doesn't deceive. Alignment is achieved through training (RLHF) and enforced through guardrails.

What this is NOT

  • Not just safety (alignment includes helpfulness and honesty)
  • Not just instruction following (alignment includes values)
  • Not solved (alignment is an ongoing research challenge)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Techniques and properties that make LLMs behave as intended: following instructions, refusing harmful requests, being truthful. Achieved via RLHF, Constitutional AI, and similar training approaches.

Sources: Anthropic HHH framework, InstructGPT paper, Constitutional AI paper

ai-safety

The broader research agenda of ensuring AI systems pursue goals aligned with human values, even as systems become more capable. Includes technical approaches, governance, and interpretability.

Sources: AI alignment research literature, Alignment Forum

Examples

  • Model refusing to provide instructions for harmful activities
  • Model admitting uncertainty rather than hallucinating
  • Model following user instructions while respecting safety bounds
  • Constitutional AI training with principles

Counterexamples

Things that might seem like Alignment but are not:

  • Model that generates harmful content readily
  • Model that confidently states false information
  • Model that deceives users

Relations

  • overlapsWith rlhf (RLHF is an alignment technique)
  • overlapsWith guardrails (Guardrails enforce alignment at inference)
  • overlapsWith jailbreak (Jailbreaks break alignment)
  • overlapsWith red-teaming (Red-teaming tests alignment)

Implementations

Tools and frameworks that implement this concept: