MEVZU N°12808.05.2026ISTANBULYEAR I — VOL. III

MEVZU N° TAG / VOL. 059

#eval

0 blog · 0 news · 15 wiki

§03

Wiki

§01Glossary

ROUGE

A classic summarization metric based on n-gram and sequence overlap.

EN: ROUGE
TR: ROUGE

§02Glossary

Lmsys Chatbot Arena

A public eval platform that ranks blind pairs of models by human preference.

EN: Lmsys Chatbot Arena
TR: Lmsys Chatbot Arena

§03Glossary

Eval

A test suite that scores a model or system against predefined criteria.

EN: Eval
TR: Eval — Değerlendirme

§04Glossary

Benchmark

A standardized test set and evaluation protocol used to compare models.

EN: Benchmark
TR: Kıyaslama (Benchmark)

§05Glossary

Hallucination Rate

A metric that measures how often a model fabricates or generates incorrect information.

EN: Hallucination Rate
TR: Halüsinasyon Oranı

§06Glossary

BLEU

A classic machine-translation metric based on n-gram overlap with reference translations.

EN: BLEU
TR: BLEU

§07Glossary

MMLU

A broad multiple-choice benchmark that tests knowledge and reasoning across 57 subjects.

EN: MMLU
TR: MMLU

§08Glossary

GSM8K

A benchmark that measures step-by-step reasoning with grade-school math problems.

EN: GSM8K
TR: GSM8K

§09Glossary

Elo Rating

A rating system from chess that derives relative skill scores from pairwise match outcomes.

EN: Elo Rating
TR: Elo Reytingi

§10Glossary

MBPP

A Google coding benchmark of nearly 1,000 basic Python problems.

EN: MBPP
TR: MBPP

§11Glossary

Pairwise Comparison

An eval method that asks which of two models' answers to the same prompt is better.

EN: Pairwise Comparison
TR: İkili Karşılaştırma

§12Glossary

LLM-as-Judge

An evaluation method in which an LLM is used to judge another model's output.

EN: LLM-as-Judge
TR: Yargıç Olarak LLM

§13Glossary

Red Teaming

The practice of probing an AI system's limits and weaknesses with adversarial methods.

EN: Red Teaming
TR: Red Teaming

§14Glossary

HumanEval

An OpenAI coding benchmark that evaluates Python functions against unit tests.

EN: HumanEval
TR: HumanEval

§15Glossary

Evaluation Loop

A feedback loop that continuously measures and refines an agent's output.

EN: Evaluation Loop
TR: Değerlendirme Döngüsü