Llama Guard

library active open-source

Meta's safety classifier model for LLM applications. Llama Guard is a fine-tuned Llama model that classifies prompts and responses according to safety policies. It can detect unsafe content across multiple risk categories and is designed to be used alongside other LLMs as a guardrail.

Implements

Concepts this tool claims to implement:

  • Guardrails primary

    Input and output classification. Safety taxonomy with categories like violence, sexual content, self-harm, hate speech. Returns safe/unsafe classification with category identification.

  • Alignment secondary

    Policy-based safety evaluation. Customizable safety policies. Can be fine-tuned for specific use cases.

  • Red Teaming secondary

    Can be used to evaluate other models' safety. Identify unsafe outputs during testing. Benchmark safety behaviors.

Integration Surfaces

  • Hugging Face Transformers
  • vLLM serving
  • Ollama
  • Any LLM serving infrastructure

Details

Vendor
Meta
License
Llama 2 Community License
Runs On
local, cloud
Used By
system

Notes

Llama Guard is a model, not a framework - it's a specialized LLM for safety classification. Llama Guard 2 and 3 added more categories and improved accuracy. Can be self-hosted unlike API-based safety services. Often used alongside production LLMs as an input/output filter.