Training Data
Also known as: Pre-training Data, Training Corpus
Definition
The text corpus used to train an LLM's weights during pre-training. Training data determines what the model "knows"—its vocabulary, facts, reasoning patterns, and biases all come from the training corpus. Modern LLMs are trained on trillions of tokens from web pages, books, code, and other text sources.
What this is NOT
- Not the model weights (training data produces weights)
- Not inference data (training happens before deployment)
- Not fine-tuning data specifically (though related)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
The dataset used for pre-training (Common Crawl, books, Wikipedia, code) and/or fine-tuning. Quality, diversity, and size of training data are critical factors in model capability.
Sources: LLaMA paper (training data description), GPT-3 paper, Pile dataset documentation
Examples
- Common Crawl web scrape for pre-training
- The Pile (800GB diverse dataset)
- Stack (code from GitHub)
- Wikipedia dumps in multiple languages
Counterexamples
Things that might seem like Training Data but are not:
- User prompts at inference time
- Retrieved documents in RAG
- The model's weights (output of training)
Relations
- overlapsWith fine-tuning (Fine-tuning uses additional training data)
- overlapsWith dataset (Training data is a type of dataset)
- overlapsWith large-language-model (LLMs are trained on training data)
Implementations
Tools and frameworks that implement this concept:
- Hugging Face secondary
- Labelbox primary
- Snorkel AI primary
- Weights & Biases secondary