Llama Guard

library active open-source

Meta's safety classifier model for LLM applications. Llama Guard is a fine-tuned Llama model that classifies prompts and responses according to safety policies. It can detect unsafe content across multiple risk categories and is designed to be used alongside other LLMs as a guardrail.

Implements

Concepts this tool claims to implement:

Guardrails primary

Input and output classification. Safety taxonomy with categories like violence, sexual content, self-harm, hate speech. Returns safe/unsafe classification with category identification.
Alignment secondary

Policy-based safety evaluation. Customizable safety policies. Can be fine-tuned for specific use cases.
Red Teaming secondary

Can be used to evaluate other models' safety. Identify unsafe outputs during testing. Benchmark safety behaviors.

Integration Surfaces

Details

Vendor: Meta
License: Llama 2 Community License
Runs On: local, cloud
Used By: system

Notes

Llama Guard is a model, not a framework - it's a specialized LLM for safety classification. Llama Guard 2 and 3 added more categories and improved accuracy. Can be self-hosted unlike API-based safety services. Often used alongside production LLMs as an input/output filter.

Implements

Integration Surfaces

Details

Links

Notes

Related Tools