Benchmark
Also known as: Evaluation Benchmark, Test Suite, Eval
Definition
A standardized dataset and evaluation protocol for measuring LLM performance on specific capabilities or tasks. Benchmarks enable comparison across models and tracking of progress over time. They typically include test cases, expected outputs or criteria, and scoring methods.
What this is NOT
- Not production metrics (benchmarks are standardized tests)
- Not unit tests (benchmarks compare across models)
- Not A/B testing (benchmarks have fixed test sets)
Alternative Interpretations
Different communities use this term differently:
llm-practitioners
Standard evaluations like MMLU, HumanEval, GSM8K that are used to compare model capabilities. Results are often reported in model cards and used for model selection.
Sources: MMLU paper, HumanEval (OpenAI), Stanford HELM, LM Evaluation Harness
Examples
- MMLU: 5-shot multiple choice across 57 subjects
- HumanEval: Generate Python functions from docstrings
- MT-Bench: Multi-turn conversation quality
- AlpacaEval: Instruction-following comparison
Counterexamples
Things that might seem like Benchmark but are not:
- Production monitoring metrics (not standardized)
- User satisfaction surveys (not a benchmark)
- Internal test suites (not cross-model comparable)
Relations
- overlapsWith faithfulness (Faithfulness can be benchmarked)
- overlapsWith hallucination (Hallucination rate can be benchmarked)
- overlapsWith red-teaming (Safety benchmarks exist)
Implementations
Tools and frameworks that implement this concept: