Synthetic Data

Artifact data published

Also known as: Generated Data, AI-Generated Data, Artificial Data

Definition

Data generated by AI models rather than collected from real-world sources. Synthetic data is increasingly used to train and fine-tune LLMs, especially when real data is scarce, expensive, or raises privacy concerns. A powerful model (like GPT-4) can generate training data for smaller models.

What this is NOT

  • Not human-written data (synthetic is model-generated)
  • Not web scraping (that's real, not synthetic)
  • Not data augmentation generally (synthetic is specifically generated)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Text, conversations, or examples generated by LLMs for the purpose of training other models. Used extensively for instruction tuning (Alpaca, Phi) and data augmentation.

Sources: Alpaca (generated by GPT-3.5), Phi models (trained on synthetic data), Self-Instruct paper

Examples

  • Alpaca dataset generated by GPT-3.5 from seed instructions
  • Phi models trained on GPT-4 generated textbooks
  • Synthetic Q&A pairs for fine-tuning
  • Generated persona conversations for chat training

Counterexamples

Things that might seem like Synthetic Data but are not:

  • Human-written instructions
  • Web-scraped text
  • Manually annotated examples

Relations

  • overlapsWith distillation (Distillation often uses synthetic data)
  • overlapsWith dataset (Synthetic data is a type of dataset)
  • overlapsWith training-data (Synthetic data can be training data)

Implementations

Tools and frameworks that implement this concept: