Quantization
Also known as: Model Quantization, Weight Quantization, INT8/INT4
Definition
Reducing the numerical precision of model weights and/or activations (e.g., from 32-bit floats to 8-bit or 4-bit integers) to decrease memory usage and increase inference speed, with minimal quality loss. Quantization makes it possible to run large models on consumer hardware.
What this is NOT
- Not distillation (quantization keeps the same architecture)
- Not pruning (quantization keeps all weights, just lower precision)
- Not compression in general (specifically about numerical precision)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Converting LLM weights to lower precision formats (INT8, INT4, FP16) to reduce VRAM requirements and speed up inference. Tools like llama.cpp, GPTQ, AWQ, and bitsandbytes enable quantized inference.
Sources: llama.cpp documentation, GPTQ paper, AWQ paper, bitsandbytes documentation
Examples
- Running Llama-70B in 4-bit with GPTQ on a 24GB GPU
- llama.cpp Q4_K_M quantization for local inference
- bitsandbytes INT8 quantization for fine-tuning
- AWQ 4-bit quantization preserving quality on difficult tasks
Counterexamples
Things that might seem like Quantization but are not:
- Distillation (trains a smaller model)
- Pruning (removes weights)
- Running full-precision FP32 inference
Relations
- overlapsWith inference (Quantization enables efficient inference)
- overlapsWith distillation (Both are model compression techniques)
- requires large-language-model (Quantization is applied to LLMs)
Implementations
Tools and frameworks that implement this concept: