Quantization

Process models published

Also known as: Model Quantization, Weight Quantization, INT8/INT4

Definition

Reducing the numerical precision of model weights and/or activations (e.g., from 32-bit floats to 8-bit or 4-bit integers) to decrease memory usage and increase inference speed, with minimal quality loss. Quantization makes it possible to run large models on consumer hardware.

What this is NOT

Not distillation (quantization keeps the same architecture)
Not pruning (quantization keeps all weights, just lower precision)
Not compression in general (specifically about numerical precision)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Converting LLM weights to lower precision formats (INT8, INT4, FP16) to reduce VRAM requirements and speed up inference. Tools like llama.cpp, GPTQ, AWQ, and bitsandbytes enable quantized inference.

Sources: llama.cpp documentation, GPTQ paper, AWQ paper, bitsandbytes documentation

Examples

Running Llama-70B in 4-bit with GPTQ on a 24GB GPU
llama.cpp Q4_K_M quantization for local inference
bitsandbytes INT8 quantization for fine-tuning
AWQ 4-bit quantization preserving quality on difficult tasks

Counterexamples

Things that might seem like Quantization but are not:

Distillation (trains a smaller model)
Pruning (removes weights)
Running full-precision FP32 inference

Relations

overlapsWith inference (Quantization enables efficient inference)
overlapsWith distillation (Both are model compression techniques)
requires large-language-model (Quantization is applied to LLMs)

Implementations

Tools and frameworks that implement this concept:

llama.cpp primary
Unsloth secondary