Quantization

Process models published

Also known as: Model Quantization, Weight Quantization, INT8/INT4

Definition

Reducing the numerical precision of model weights and/or activations (e.g., from 32-bit floats to 8-bit or 4-bit integers) to decrease memory usage and increase inference speed, with minimal quality loss. Quantization makes it possible to run large models on consumer hardware.

What this is NOT

  • Not distillation (quantization keeps the same architecture)
  • Not pruning (quantization keeps all weights, just lower precision)
  • Not compression in general (specifically about numerical precision)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Converting LLM weights to lower precision formats (INT8, INT4, FP16) to reduce VRAM requirements and speed up inference. Tools like llama.cpp, GPTQ, AWQ, and bitsandbytes enable quantized inference.

Sources: llama.cpp documentation, GPTQ paper, AWQ paper, bitsandbytes documentation

Examples

  • Running Llama-70B in 4-bit with GPTQ on a 24GB GPU
  • llama.cpp Q4_K_M quantization for local inference
  • bitsandbytes INT8 quantization for fine-tuning
  • AWQ 4-bit quantization preserving quality on difficult tasks

Counterexamples

Things that might seem like Quantization but are not:

  • Distillation (trains a smaller model)
  • Pruning (removes weights)
  • Running full-precision FP32 inference

Relations

Implementations

Tools and frameworks that implement this concept: