AIRiskAware
AI Governance Glossary
Governance Concept

What Is Benchmark?

Benchmark is a standardised dataset or task used to measure and compare the performance of AI models on a defined capability.

Definition

Benchmark โ€” a standardised dataset or task used to measure and compare the performance of AI models on a defined capability.

Benchmarks make model comparison possible, but they are an imperfect proxy for real-world performance: models can be tuned to score well on a benchmark without being correspondingly better in practice, and benchmark data can leak into training sets. Treating benchmark scores as the whole story is a common governance mistake.

Source: Machine-learning practice

Plain-language explanation

Benchmarks make model comparison possible, but they are an imperfect proxy for real-world performance: models can be tuned to score well on a benchmark without being correspondingly better in practice, and benchmark data can leak into training sets. Treating benchmark scores as the whole story is a common governance mistake.

Primary source: Machine-learning practice

Related terms

Model Evaluation AI Red Teaming Ground Truth Robustness

See where you stand on AI governance

Take the free 7-question maturity assessment and get a personalised action plan.

Free assessment โ€” 3 minutes โ†’