Training Data

Artifact data published

Also known as: Pre-training Data, Training Corpus

Definition

The text corpus used to train an LLM's weights during pre-training. Training data determines what the model "knows"—its vocabulary, facts, reasoning patterns, and biases all come from the training corpus. Modern LLMs are trained on trillions of tokens from web pages, books, code, and other text sources.

What this is NOT

  • Not the model weights (training data produces weights)
  • Not inference data (training happens before deployment)
  • Not fine-tuning data specifically (though related)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

The dataset used for pre-training (Common Crawl, books, Wikipedia, code) and/or fine-tuning. Quality, diversity, and size of training data are critical factors in model capability.

Sources: LLaMA paper (training data description), GPT-3 paper, Pile dataset documentation

Examples

  • Common Crawl web scrape for pre-training
  • The Pile (800GB diverse dataset)
  • Stack (code from GitHub)
  • Wikipedia dumps in multiple languages

Counterexamples

Things that might seem like Training Data but are not:

  • User prompts at inference time
  • Retrieved documents in RAG
  • The model's weights (output of training)

Relations

  • overlapsWith fine-tuning (Fine-tuning uses additional training data)
  • overlapsWith dataset (Training data is a type of dataset)
  • overlapsWith large-language-model (LLMs are trained on training data)

Implementations

Tools and frameworks that implement this concept: