Benchmark

Artifact evaluation published

Also known as: Evaluation Benchmark, Test Suite, Eval

Definition

A standardized dataset and evaluation protocol for measuring LLM performance on specific capabilities or tasks. Benchmarks enable comparison across models and tracking of progress over time. They typically include test cases, expected outputs or criteria, and scoring methods.

What this is NOT

Not production metrics (benchmarks are standardized tests)
Not unit tests (benchmarks compare across models)
Not A/B testing (benchmarks have fixed test sets)

Alternative Interpretations

Different communities use this term differently:

llm-practitioners

Standard evaluations like MMLU, HumanEval, GSM8K that are used to compare model capabilities. Results are often reported in model cards and used for model selection.

Sources: MMLU paper, HumanEval (OpenAI), Stanford HELM, LM Evaluation Harness

Examples

MMLU: 5-shot multiple choice across 57 subjects
HumanEval: Generate Python functions from docstrings
MT-Bench: Multi-turn conversation quality
AlpacaEval: Instruction-following comparison

Counterexamples

Things that might seem like Benchmark but are not:

Production monitoring metrics (not standardized)
User satisfaction surveys (not a benchmark)
Internal test suites (not cross-model comparable)

Relations

overlapsWith faithfulness (Faithfulness can be benchmarked)
overlapsWith hallucination (Hallucination rate can be benchmarked)
overlapsWith red-teaming (Safety benchmarks exist)

Implementations

Tools and frameworks that implement this concept:

Arize AI primary
Langfuse primary
LangSmith primary
Promptfoo primary
RAGAS primary
Scale AI secondary
Weights & Biases primary