Synthetic Data

Artifact data published

Also known as: Generated Data, AI-Generated Data, Artificial Data

Definition

Data generated by AI models rather than collected from real-world sources. Synthetic data is increasingly used to train and fine-tune LLMs, especially when real data is scarce, expensive, or raises privacy concerns. A powerful model (like GPT-4) can generate training data for smaller models.

What this is NOT

Not human-written data (synthetic is model-generated)
Not web scraping (that's real, not synthetic)
Not data augmentation generally (synthetic is specifically generated)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Text, conversations, or examples generated by LLMs for the purpose of training other models. Used extensively for instruction tuning (Alpaca, Phi) and data augmentation.

Sources: Alpaca (generated by GPT-3.5), Phi models (trained on synthetic data), Self-Instruct paper

Examples

Alpaca dataset generated by GPT-3.5 from seed instructions
Phi models trained on GPT-4 generated textbooks
Synthetic Q&A pairs for fine-tuning
Generated persona conversations for chat training

Counterexamples

Things that might seem like Synthetic Data but are not:

Human-written instructions
Web-scraped text
Manually annotated examples

Relations

overlapsWith distillation (Distillation often uses synthetic data)
overlapsWith dataset (Synthetic data is a type of dataset)
overlapsWith training-data (Synthetic data can be training data)

Implementations

Tools and frameworks that implement this concept:

Snorkel AI primary