Pairwise Comparison

An eval method that asks which of two models' answers to the same prompt is better.

EN — English term: Pairwise Comparison
TR — Turkish term: İkili Karşılaştırma

Pairwise comparison is the evaluation method that places two models' (or two versions') answers to the same prompt side by side and asks which is better. It's more reliable than absolute scoring because both humans and LLM-as-Judge tend to be more consistent in relative judgments. Chatbot Arena scaled this idea: random blind pairings rated by users and aggregated into an Elo Rating leaderboard. In production teams it's commonly the engine behind A/B tests and version-promotion decisions.