LLM-as-Judge

LLM-as-Judge is an evaluation method where one LLM scores another model's output, or makes a Pairwise Comparison between two candidates. Lmsys's MT-Bench work in 2023 showed this approach can correlate strongly with human preference and the technique was widely adopted. Its main advantages are automation, scalability, and relatively low cost. Typical biases — self-preference, position bias, length preference — must be controlled, often via response swapping, multiple judges, and rubric calibration.