llama.cpp

tool active open-source

A C/C++ implementation of LLM inference optimized for running on consumer hardware without GPUs. llama.cpp pioneered efficient CPU inference and quantization formats (GGUF) that make LLMs accessible on laptops and edge devices.

Implements

Concepts this tool claims to implement:

Inference primary

Core purpose is efficient LLM inference. CPU-optimized with SIMD, Metal (Apple), CUDA, and other accelerator support.
Quantization primary

Pioneered GGUF format with multiple quantization levels (Q4_K_M, Q5_K_M, etc.). Enables running large models on limited hardware.
Edge Deployment primary

Runs on consumer CPUs, Apple Silicon, and resource-constrained devices. No cloud dependency.

Integration Surfaces

Details

Vendor: Georgi Gerganov (community)
License: MIT
Runs On: local
Used By: system

Notes

llama.cpp made local LLM inference practical. Its GGUF format is the standard for quantized models. Many tools (Ollama, LMStudio) use llama.cpp as their inference backend.

Implements

Integration Surfaces

Details

Links

Notes

Related Tools