Llama Guard
Meta's safety classifier model for LLM applications. Llama Guard is a fine-tuned Llama model that classifies prompts and responses according to safety policies. It can detect unsafe content across multiple risk categories and is designed to be used alongside other LLMs as a guardrail.
Implements
Concepts this tool claims to implement:
- Guardrails primary
Input and output classification. Safety taxonomy with categories like violence, sexual content, self-harm, hate speech. Returns safe/unsafe classification with category identification.
- Alignment secondary
Policy-based safety evaluation. Customizable safety policies. Can be fine-tuned for specific use cases.
- Red Teaming secondary
Can be used to evaluate other models' safety. Identify unsafe outputs during testing. Benchmark safety behaviors.
Integration Surfaces
Details
- Vendor
- Meta
- License
- Llama 2 Community License
- Runs On
- local, cloud
- Used By
- system
Links
Notes
Llama Guard is a model, not a framework - it's a specialized LLM for safety classification. Llama Guard 2 and 3 added more categories and improved accuracy. Can be self-hosted unlike API-based safety services. Often used alongside production LLMs as an input/output filter.