Benchmark

A benchmark is a standardized test set with an agreed-on task definition and metric, used to compare models head-to-head. NLP has GLUE/SuperGLUE, coding has HumanEval and MBPP, general knowledge has MMLU, math has GSM8K, and human preference is captured by Chatbot Arena. Benchmarks historically gave the field a common language for progress, but as models saturate them — reaching near-ceiling — successively harder ones (MMLU-Pro, GPQA, Humanity's Last Exam) become necessary. Production-quality evaluation usually complements public benchmarks with custom evals that capture true product surface.