llama.cpp
A C/C++ implementation of LLM inference optimized for running on consumer hardware without GPUs. llama.cpp pioneered efficient CPU inference and quantization formats (GGUF) that make LLMs accessible on laptops and edge devices.
Implements
Concepts this tool claims to implement:
- Inference primary
Core purpose is efficient LLM inference. CPU-optimized with SIMD, Metal (Apple), CUDA, and other accelerator support.
- Quantization primary
Pioneered GGUF format with multiple quantization levels (Q4_K_M, Q5_K_M, etc.). Enables running large models on limited hardware.
- Edge Deployment primary
Runs on consumer CPUs, Apple Silicon, and resource-constrained devices. No cloud dependency.
Integration Surfaces
Details
- Vendor
- Georgi Gerganov (community)
- License
- MIT
- Runs On
- local
- Used By
- system
Links
Notes
llama.cpp made local LLM inference practical. Its GGUF format is the standard for quantized models. Many tools (Ollama, LMStudio) use llama.cpp as their inference backend.