Synthetic Data
Also known as: Generated Data, AI-Generated Data, Artificial Data
Definition
Data generated by AI models rather than collected from real-world sources. Synthetic data is increasingly used to train and fine-tune LLMs, especially when real data is scarce, expensive, or raises privacy concerns. A powerful model (like GPT-4) can generate training data for smaller models.
What this is NOT
- Not human-written data (synthetic is model-generated)
- Not web scraping (that's real, not synthetic)
- Not data augmentation generally (synthetic is specifically generated)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Text, conversations, or examples generated by LLMs for the purpose of training other models. Used extensively for instruction tuning (Alpaca, Phi) and data augmentation.
Sources: Alpaca (generated by GPT-3.5), Phi models (trained on synthetic data), Self-Instruct paper
Examples
- Alpaca dataset generated by GPT-3.5 from seed instructions
- Phi models trained on GPT-4 generated textbooks
- Synthetic Q&A pairs for fine-tuning
- Generated persona conversations for chat training
Counterexamples
Things that might seem like Synthetic Data but are not:
- Human-written instructions
- Web-scraped text
- Manually annotated examples
Relations
- overlapsWith distillation (Distillation often uses synthetic data)
- overlapsWith dataset (Synthetic data is a type of dataset)
- overlapsWith training-data (Synthetic data can be training data)
Implementations
Tools and frameworks that implement this concept:
- Snorkel AI primary