Skip to content
MEVZU N°128ISTANBUL

MEVZU N° TAG / VOL. 059

#eval

0 blog · 0 news · 15 wiki

§03

Wiki

15
§01Glossary

ROUGE

A classic summarization metric based on n-gram and sequence overlap.

EN
ROUGE
TR
ROUGE
§02Glossary

Lmsys Chatbot Arena

A public eval platform that ranks blind pairs of models by human preference.

EN
Lmsys Chatbot Arena
TR
Lmsys Chatbot Arena
§03Glossary

Eval

A test suite that scores a model or system against predefined criteria.

EN
Eval
TR
Eval — Değerlendirme
§04Glossary

Benchmark

A standardized test set and evaluation protocol used to compare models.

EN
Benchmark
TR
Kıyaslama (Benchmark)
§05Glossary

Hallucination Rate

A metric that measures how often a model fabricates or generates incorrect information.

EN
Hallucination Rate
TR
Halüsinasyon Oranı
§06Glossary

BLEU

A classic machine-translation metric based on n-gram overlap with reference translations.

EN
BLEU
TR
BLEU
§07Glossary

MMLU

A broad multiple-choice benchmark that tests knowledge and reasoning across 57 subjects.

EN
MMLU
TR
MMLU
§08Glossary

GSM8K

A benchmark that measures step-by-step reasoning with grade-school math problems.

EN
GSM8K
TR
GSM8K
§09Glossary

Elo Rating

A rating system from chess that derives relative skill scores from pairwise match outcomes.

EN
Elo Rating
TR
Elo Reytingi
§10Glossary

MBPP

A Google coding benchmark of nearly 1,000 basic Python problems.

EN
MBPP
TR
MBPP
§11Glossary

Pairwise Comparison

An eval method that asks which of two models' answers to the same prompt is better.

EN
Pairwise Comparison
TR
İkili Karşılaştırma
§12Glossary

LLM-as-Judge

An evaluation method in which an LLM is used to judge another model's output.

EN
LLM-as-Judge
TR
Yargıç Olarak LLM
§13Glossary

Red Teaming

The practice of probing an AI system's limits and weaknesses with adversarial methods.

EN
Red Teaming
TR
Red Teaming
§14Glossary

HumanEval

An OpenAI coding benchmark that evaluates Python functions against unit tests.

EN
HumanEval
TR
HumanEval
§15Glossary

Evaluation Loop

A feedback loop that continuously measures and refines an agent's output.

EN
Evaluation Loop
TR
Değerlendirme Döngüsü