Alignment

Property evaluation published

Also known as: AI Alignment, Value Alignment, Model Alignment

Definition

The degree to which an AI system's behavior matches intended goals, values, and constraints. For LLMs, alignment means the model is helpful, harmless, and honest—it does what users want, avoids harmful outputs, and doesn't deceive. Alignment is achieved through training (RLHF) and enforced through guardrails.

What this is NOT

Not just safety (alignment includes helpfulness and honesty)
Not just instruction following (alignment includes values)
Not solved (alignment is an ongoing research challenge)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Techniques and properties that make LLMs behave as intended: following instructions, refusing harmful requests, being truthful. Achieved via RLHF, Constitutional AI, and similar training approaches.

Sources: Anthropic HHH framework, InstructGPT paper, Constitutional AI paper

ai-safety

The broader research agenda of ensuring AI systems pursue goals aligned with human values, even as systems become more capable. Includes technical approaches, governance, and interpretability.

Sources: AI alignment research literature, Alignment Forum

Examples

Model refusing to provide instructions for harmful activities
Model admitting uncertainty rather than hallucinating
Model following user instructions while respecting safety bounds
Constitutional AI training with principles

Counterexamples

Things that might seem like Alignment but are not:

Model that generates harmful content readily
Model that confidently states false information
Model that deceives users

Relations

overlapsWith rlhf (RLHF is an alignment technique)
overlapsWith guardrails (Guardrails enforce alignment at inference)
overlapsWith jailbreak (Jailbreaks break alignment)
overlapsWith red-teaming (Red-teaming tests alignment)

Implementations

Tools and frameworks that implement this concept:

Anthropic secondary
Llama Guard secondary
NeMo Guardrails secondary