HumanEval is a coding Benchmark of 164 hand-written Python programming tasks introduced in OpenAI's 2021 Codex paper. Each problem ships with a docstring, a function signature, and hidden unit tests; the metric is "pass@k," the probability that at least one of k samples passes. Every major model from Codex and Copilot through GPT-4 and DeepSeek R1 has reported its progress on this benchmark. As models neared saturation, harder coding benchmarks like MBPP, LiveCodeBench, and SWE-bench took center stage.
MEVZU N°124ISTANBULYEAR I — VOL. III
Glossary · Intermediate · 2021
HumanEval
An OpenAI coding benchmark that evaluates Python functions against unit tests.
- EN — English term
- HumanEval
- TR — Turkish term
- HumanEval