Dataset

Artifact data published

Also known as: Data Set, Corpus, Data Collection

Definition

A structured collection of data used for training, fine-tuning, or evaluating AI models. Datasets can contain text, images, labels, or structured records. In the LLM context, datasets are used for pre-training, instruction tuning, RLHF, evaluation, and RAG knowledge bases.

What this is NOT

Not a database (datasets are for ML, not transactional use)
Not the model (datasets train models)
Not a single example (datasets are collections)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Collections of examples used for various purposes: pre-training corpora, instruction-tuning datasets (Alpaca, ShareGPT), evaluation benchmarks (MMLU), or knowledge bases. Often hosted on Hugging Face.

Sources: Hugging Face Datasets, Various dataset papers

Examples

Alpaca dataset (52K instruction-response pairs)
Common Crawl (petabytes of web text)
ShareGPT (conversations shared from ChatGPT)
MMLU (14K questions across 57 subjects)

Counterexamples

Things that might seem like Dataset but are not:

A single prompt-response pair
The trained model weights
A live database with changing data

Relations

generalizes training-data (Training data is a type of dataset)
overlapsWith knowledge-base (Knowledge bases are datasets for retrieval)
overlapsWith benchmark (Benchmarks include evaluation datasets)