Red Teaming
Also known as: Adversarial Testing, Safety Testing, Attack Testing
Definition
Systematically testing AI systems by attempting to make them fail, produce harmful outputs, or behave in unintended ways. Red teams act as adversaries, probing for vulnerabilities through prompt injection, jailbreaks, edge cases, and creative attacks. The goal is to find problems before malicious users do.
What this is NOT
- Not regular testing (red teaming is adversarial)
- Not benchmarking (red teaming looks for failures, not average performance)
- Not jailbreaking for fun (red teaming is systematic and documented)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The practice of attacking LLM systems to find safety failures, jailbreaks, and harmful outputs. Can be manual (expert adversaries) or automated (adversarial prompt generation). Required before deploying sensitive applications.
Sources: OpenAI red teaming practices, Anthropic model cards, AI safety red teaming literature
security
A security testing methodology where testers adopt an adversary mindset to find vulnerabilities. Borrowed from cybersecurity practice.
Sources: Security red teaming literature
Examples
- Team attempting to elicit harmful content from ChatGPT
- Automated jailbreak prompt generation
- Testing for prompt injection in a RAG application
- Probing for bias in hiring assistance tool
Counterexamples
Things that might seem like Red Teaming but are not:
- Regular functional testing
- Performance benchmarking
- User acceptance testing
Relations
- overlapsWith jailbreak (Jailbreaking is one red team technique)
- overlapsWith prompt-injection (Testing for injection is part of red teaming)
- overlapsWith alignment (Red teaming tests alignment)
- overlapsWith benchmark (Red team results can become benchmarks)
Implementations
Tools and frameworks that implement this concept:
- Llama Guard secondary
- Promptfoo primary