Red Teaming

Process evaluation published

Also known as: Adversarial Testing, Safety Testing, Attack Testing

Definition

Systematically testing AI systems by attempting to make them fail, produce harmful outputs, or behave in unintended ways. Red teams act as adversaries, probing for vulnerabilities through prompt injection, jailbreaks, edge cases, and creative attacks. The goal is to find problems before malicious users do.

What this is NOT

  • Not regular testing (red teaming is adversarial)
  • Not benchmarking (red teaming looks for failures, not average performance)
  • Not jailbreaking for fun (red teaming is systematic and documented)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The practice of attacking LLM systems to find safety failures, jailbreaks, and harmful outputs. Can be manual (expert adversaries) or automated (adversarial prompt generation). Required before deploying sensitive applications.

Sources: OpenAI red teaming practices, Anthropic model cards, AI safety red teaming literature

security

A security testing methodology where testers adopt an adversary mindset to find vulnerabilities. Borrowed from cybersecurity practice.

Sources: Security red teaming literature

Examples

  • Team attempting to elicit harmful content from ChatGPT
  • Automated jailbreak prompt generation
  • Testing for prompt injection in a RAG application
  • Probing for bias in hiring assistance tool

Counterexamples

Things that might seem like Red Teaming but are not:

  • Regular functional testing
  • Performance benchmarking
  • User acceptance testing

Relations

  • overlapsWith jailbreak (Jailbreaking is one red team technique)
  • overlapsWith prompt-injection (Testing for injection is part of red teaming)
  • overlapsWith alignment (Red teaming tests alignment)
  • overlapsWith benchmark (Red team results can become benchmarks)

Implementations

Tools and frameworks that implement this concept: