llama.cpp

tool active open-source

A C/C++ implementation of LLM inference optimized for running on consumer hardware without GPUs. llama.cpp pioneered efficient CPU inference and quantization formats (GGUF) that make LLMs accessible on laptops and edge devices.

Implements

Concepts this tool claims to implement:

  • Inference primary

    Core purpose is efficient LLM inference. CPU-optimized with SIMD, Metal (Apple), CUDA, and other accelerator support.

  • Pioneered GGUF format with multiple quantization levels (Q4_K_M, Q5_K_M, etc.). Enables running large models on limited hardware.

  • Runs on consumer CPUs, Apple Silicon, and resource-constrained devices. No cloud dependency.

Integration Surfaces

  • CLI
  • C/C++ library
  • Python bindings (llama-cpp-python)
  • Server mode

Details

Vendor
Georgi Gerganov (community)
License
MIT
Runs On
local
Used By
system

Notes

llama.cpp made local LLM inference practical. Its GGUF format is the standard for quantized models. Many tools (Ollama, LMStudio) use llama.cpp as their inference backend.