Distillation

Process models published

Also known as: Knowledge Distillation, Model Distillation, Teacher-Student Learning

Definition

Training a smaller "student" model to mimic the behavior of a larger "teacher" model, transferring the teacher's knowledge into a more compact form. The student learns from the teacher's outputs (soft labels) rather than just ground truth data, capturing nuances the teacher learned. Distillation creates smaller, faster models that retain much of the original capability.

What this is NOT

  • Not quantization (distillation trains a new model; quantization compresses weights)
  • Not fine-tuning (distillation creates a new smaller model)
  • Not just using a smaller model (distillation transfers knowledge)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Using outputs from a capable LLM (GPT-4, Claude) to train a smaller, cheaper model. Common for creating specialized models that are faster and cheaper to run while maintaining quality on specific tasks.

Sources: Alpaca (Llama distilled from GPT-3.5), Vicuna and similar projects, OpenAI distillation policies

ml-research

A knowledge transfer technique where a student network is trained on the soft probability distributions output by a teacher network, often with temperature scaling to reveal more information.

Sources: Hinton et al. 'Distilling Knowledge in a Neural Network' (2015), Knowledge distillation surveys

Examples

  • Alpaca: Llama fine-tuned on GPT-3.5 generated instructions
  • DistilBERT: BERT distilled to 40% fewer parameters
  • Training a custom model on GPT-4 outputs for a specific task
  • Phi-3: Small model trained on synthetic data from larger models

Counterexamples

Things that might seem like Distillation but are not:

  • Quantization (same model, lower precision)
  • Pruning (same model, fewer weights)
  • Training from scratch on raw data

Relations

  • overlapsWith fine-tuning (Distillation often involves fine-tuning on teacher outputs)
  • overlapsWith quantization (Both are model compression techniques)
  • requires large-language-model (Teacher models are typically large LLMs)