HumanEval

HumanEval is a coding Benchmark of 164 hand-written Python programming tasks introduced in OpenAI's 2021 Codex paper. Each problem ships with a docstring, a function signature, and hidden unit tests; the metric is "pass@k," the probability that at least one of k samples passes. Every major model from Codex and Copilot through GPT-4 and DeepSeek R1 has reported its progress on this benchmark. As models neared saturation, harder coding benchmarks like MBPP, LiveCodeBench, and SWE-bench took center stage.