Benchmark

Artifact evaluation published

Also known as: Evaluation Benchmark, Test Suite, Eval

Definition

A standardized dataset and evaluation protocol for measuring LLM performance on specific capabilities or tasks. Benchmarks enable comparison across models and tracking of progress over time. They typically include test cases, expected outputs or criteria, and scoring methods.

What this is NOT

  • Not production metrics (benchmarks are standardized tests)
  • Not unit tests (benchmarks compare across models)
  • Not A/B testing (benchmarks have fixed test sets)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Standard evaluations like MMLU, HumanEval, GSM8K that are used to compare model capabilities. Results are often reported in model cards and used for model selection.

Sources: MMLU paper, HumanEval (OpenAI), Stanford HELM, LM Evaluation Harness

Examples

  • MMLU: 5-shot multiple choice across 57 subjects
  • HumanEval: Generate Python functions from docstrings
  • MT-Bench: Multi-turn conversation quality
  • AlpacaEval: Instruction-following comparison

Counterexamples

Things that might seem like Benchmark but are not:

  • Production monitoring metrics (not standardized)
  • User satisfaction surveys (not a benchmark)
  • Internal test suites (not cross-model comparable)

Relations

Implementations

Tools and frameworks that implement this concept: