Dataset

Artifact data published

Also known as: Data Set, Corpus, Data Collection

Definition

A structured collection of data used for training, fine-tuning, or evaluating AI models. Datasets can contain text, images, labels, or structured records. In the LLM context, datasets are used for pre-training, instruction tuning, RLHF, evaluation, and RAG knowledge bases.

What this is NOT

  • Not a database (datasets are for ML, not transactional use)
  • Not the model (datasets train models)
  • Not a single example (datasets are collections)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Collections of examples used for various purposes: pre-training corpora, instruction-tuning datasets (Alpaca, ShareGPT), evaluation benchmarks (MMLU), or knowledge bases. Often hosted on Hugging Face.

Sources: Hugging Face Datasets, Various dataset papers

Examples

  • Alpaca dataset (52K instruction-response pairs)
  • Common Crawl (petabytes of web text)
  • ShareGPT (conversations shared from ChatGPT)
  • MMLU (14K questions across 57 subjects)

Counterexamples

Things that might seem like Dataset but are not:

  • A single prompt-response pair
  • The trained model weights
  • A live database with changing data

Relations

  • generalizes training-data (Training data is a type of dataset)
  • overlapsWith knowledge-base (Knowledge bases are datasets for retrieval)
  • overlapsWith benchmark (Benchmarks include evaluation datasets)