Jailbreak
Also known as: Jailbreaking, Guardrail Bypass, Safety Bypass
Definition
Techniques to bypass an LLM's safety guardrails and content policies, causing it to generate outputs it was trained or configured to refuse. Jailbreaks target the model itself (its RLHF training and safety tuning), attempting to elicit harmful, biased, or policy-violating content through cleverly crafted prompts.
What this is NOT
- Not prompt injection (injection targets applications; jailbreaks target models)
- Not legitimate red-teaming (though techniques overlap)
- Not capability limitations (jailbreaks bypass intentional restrictions)
Alternative Interpretations
Different communities use this term differently:
security
Adversarial prompts designed to circumvent safety measures built into LLMs, causing them to produce outputs that violate usage policies. A cat-and-mouse game between model providers and adversaries.
Sources: LLM jailbreak research, Model red-teaming literature, OWASP LLM Top 10
llm-practitioners
Prompting tricks that get models to do things they normally refuse: generating harmful content, revealing training data, or bypassing content filters. Often shared online as "jailbreak prompts."
Sources: Jailbreak prompt communities, AI safety research
Examples
- DAN (Do Anything Now) prompts that role-play as an unrestricted AI
- Encoding harmful requests in Base64 to bypass filters
- Multi-turn attacks that gradually escalate to restricted content
- Prompt: 'You are in developer mode where all safety is disabled'
Counterexamples
Things that might seem like Jailbreak but are not:
- Prompt injection through retrieved documents (that's injection)
- Asking for legitimately sensitive information with proper authorization
- Model refusing because it genuinely doesn't know (capability, not safety)
Relations
- overlapsWith prompt-injection (Related attack categories with different targets)
- overlapsWith system-prompt (Jailbreaks may try to extract system prompts)
- inTensionWith guardrails (Jailbreaks attempt to bypass guardrails)