Dataset
Also known as: Data Set, Corpus, Data Collection
Definition
A structured collection of data used for training, fine-tuning, or evaluating AI models. Datasets can contain text, images, labels, or structured records. In the LLM context, datasets are used for pre-training, instruction tuning, RLHF, evaluation, and RAG knowledge bases.
What this is NOT
- Not a database (datasets are for ML, not transactional use)
- Not the model (datasets train models)
- Not a single example (datasets are collections)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Collections of examples used for various purposes: pre-training corpora, instruction-tuning datasets (Alpaca, ShareGPT), evaluation benchmarks (MMLU), or knowledge bases. Often hosted on Hugging Face.
Sources: Hugging Face Datasets, Various dataset papers
Examples
- Alpaca dataset (52K instruction-response pairs)
- Common Crawl (petabytes of web text)
- ShareGPT (conversations shared from ChatGPT)
- MMLU (14K questions across 57 subjects)
Counterexamples
Things that might seem like Dataset but are not:
- A single prompt-response pair
- The trained model weights
- A live database with changing data
Relations
- generalizes training-data (Training data is a type of dataset)
- overlapsWith knowledge-base (Knowledge bases are datasets for retrieval)
- overlapsWith benchmark (Benchmarks include evaluation datasets)