A benchmark is a standardized test set with an agreed-on task definition and metric, used to compare models head-to-head. NLP has GLUE/SuperGLUE, coding has HumanEval and MBPP, general knowledge has MMLU, math has GSM8K, and human preference is captured by Chatbot Arena. Benchmarks historically gave the field a common language for progress, but as models saturate them — reaching near-ceiling — successively harder ones (MMLU-Pro, GPQA, Humanity's Last Exam) become necessary. Production-quality evaluation usually complements public benchmarks with custom evals that capture true product surface.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Beginner · 2018
Benchmark
A standardized test set and evaluation protocol used to compare models.
- EN — English term
- Benchmark
- TR — Turkish term
- Kıyaslama (Benchmark)