BLEU | AI Mevzuları

BLEU (Bilingual Evaluation Understudy), introduced by Papineni et al. at IBM in 2002, is a classic machine-translation metric based on n-gram overlap with reference translations. For more than two decades it served as the standard yardstick for translation quality — simple, deterministic, and fast made it indispensable in practice. The critique is equally well-known: it's surface-level lexical overlap that captures meaning poorly and penalizes paraphrases. Modern evaluation has shifted toward COMET, BERTScore, and LLM-as-Judge approaches.