Guardrails

System evaluation published

Also known as: Safety Guardrails, Content Filters, Safety Layers

Definition

Systems that monitor, filter, or constrain LLM inputs and outputs to prevent harmful, unsafe, or policy-violating content. Guardrails act as safety layers around LLM applications, catching problems that the model itself might not prevent. They can filter both user inputs (prompt attacks) and model outputs (harmful content).

What this is NOT

  • Not model training (guardrails are inference-time)
  • Not RLHF (that's training-time alignment)
  • Not just content moderation (guardrails include structural checks)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Additional layers that check LLM inputs/outputs for safety, compliance, and quality. Implemented via content classifiers, rule-based filters, or LLM-based moderation. Examples: NeMo Guardrails, Guardrails AI.

Sources: NVIDIA NeMo Guardrails, Guardrails AI documentation, OpenAI Moderation API

Examples

  • NVIDIA NeMo Guardrails for conversational safety
  • OpenAI Moderation API checking for harmful content
  • Custom classifier detecting prompt injection
  • Schema validation ensuring structured output format

Counterexamples

Things that might seem like Guardrails but are not:

  • RLHF training for safety (that's training-time)
  • System prompt asking model to be safe (weak, not a guardrail)
  • Hope that the model will behave (not a guardrail)

Relations

  • overlapsWith prompt-injection (Guardrails can detect injection attempts)
  • overlapsWith jailbreak (Guardrails can detect jailbreak attempts)
  • overlapsWith hallucination (Some guardrails detect hallucination)
  • overlapsWith alignment (Guardrails enforce alignment at inference time)

Implementations

Tools and frameworks that implement this concept: