ChatPaper.aiChatPaper

我们是否正走在正确评估“大语言模型即裁判”的道路上?

Are We on the Right Way to Assessing LLM-as-a-Judge?

December 17, 2025
作者: Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen
cs.AI

摘要

LLM即评委(LLM-as-a-Judge)作为一种评估方法已被广泛采用,并在模型训练中充当监督奖励信号。然而,现有的LLM即评委基准主要依赖人工标注的基准真值,这不仅引入了人为偏差、削弱了可靠性评估,还带来了可扩展性限制。为突破这些局限,我们推出Sage——一种无需任何人工标注即可评估LLM评委质量的新型评估套件。受理性选择理论公理启发,Sage引入两个全新维度来衡量LLM即评委的表现:局部自一致性(成对偏好的稳定性)与全局逻辑一致性(完整偏好集的传递性)。我们通过结合结构化基准问题与真实用户查询,构建了包含650个问题的数据集。实验证明,我们的指标不仅具有稳定性,且与LLMBar、RewardBench2等监督式基准保持高度相关性,证实了Sage作为评估LLM即评委鲁棒性与准确性的可靠性。基于Sage评估,我们发现当前最先进的LLM在评分和成对比较两种场景下担任评委时均存在显著可靠性问题:即使是表现最佳的Gemini-2.5-Pro和GPT-5模型,在近四分之一的高难度案例中仍无法保持偏好一致性。我们将此归因于一种称为“情境偏好”的新现象,该现象解释了为何明确的评分标准或准则能帮助模型在答案对之间保持判断一致性。进一步分析表明,微调LLM即评委是一种有效的性能提升手段,而委员会制评委机制与深度推理能力均可增强判断一致性。我们还发现人类判断存在显著不一致性,这表明人工标注可能并非可靠的黄金标准。
English
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
PDF221December 23, 2025