ChatPaper.aiChatPaper

评估LLM即法官的方法正确吗?

Are We on the Right Way to Assessing LLM-as-a-Judge?

December 17, 2025
作者: Yuanning Feng, Sinan Wang, Zhengxiang Cheng, Yao Wan, Dongping Chen
cs.AI

摘要

大语言模型即评判员(LLM-as-a-Judge)作为一种评估方法已被广泛采用,并在模型训练中充当监督奖励信号。然而,现有的大语言模型评判基准主要依赖人工标注的真实标签,这不仅引入了人为偏见、削弱了可靠性评估,还带来了可扩展性限制。为突破这些局限,我们推出Sage评估套件,该创新系统无需任何人工标注即可评估大语言模型评判员的质量。受理性选择理论公理启发,Sage引入两个全新维度来衡量大语言模型评判表现:局部自一致性(成对偏好的稳定性)与全局逻辑一致性(全偏好集的传递性)。我们通过整合结构化基准问题与真实用户查询,构建了包含650个问题的数据集。实验结果表明,我们的指标既具有稳定性,又与LLMBar、RewardBench2等监督基准保持高度相关性,证实了Sage作为大语言模型评判员鲁棒性与准确性评估工具的可信度。基于Sage评估,我们发现当前最先进的大语言模型在评分和成对比较场景中担任评判员时存在显著可靠性问题——即使是表现最佳的Gemini-2.5-Pro和GPT-5模型,在近四分之一的高难度案例中仍无法保持偏好一致性。我们将此归因于一种称为"情境偏好"的新现象,这解释了为何明确的评分标准能帮助模型在不同答案对间保持评判一致性。进一步分析表明,微调后的大语言模型评判员是提升性能的可行方法,而小组评审机制与深度推理能增强评判一致性。我们还发现人类判断存在显著不一致性,这表明人工标注可能并非可靠的黄金标准。
English
LLM-as-a-Judge has been widely adopted as an evaluation method and served as supervised rewards in model training. However, existing benchmarks for LLM-as-a-Judge are mainly relying on human-annotated ground truth, which introduces human bias that undermines the assessment of reliability and imposes scalability constraints. To overcome these limitations, we introduce Sage, a novel evaluation suite that assesses the quality of LLM judges without necessitating any human annotation. Inspired by axioms of rational choice theory, Sage introduces two new lenses for measuring LLM-as-a-Judge: local self-consistency (pair-wise preference stability) and global logical consistency (transitivity across a full set of preferences). We curate a dataset of 650 questions by combining structured benchmark problems with real-world user queries. Our experiments demonstrate both the stability of our metrics and their high correlation with supervised benchmarks like LLMBar and RewardBench2, confirming Sage's reliability as an evaluation suite for the robustness and accuracy of LLM-as-a-Judge. Based on Sage, we reveal that current state-of-the-art LLMs exhibit significant reliability problems when acting as judges in both scoring and pairwise settings; even the top-performing models, Gemini-2.5-Pro and GPT-5, fail to maintain consistent preferences in nearly a quarter of difficult cases. We attribute this to a new phenomenon called situational preference, which explains why explicit rubrics or criteria can help the model judge consistently across answer pairs. Our further analysis shows that finetuned LLM-as-a-Judge is a feasible method to boost performance, and the panel-based judge as well as deep reasoning can enhance the judging consistency. We also find substantial inconsistency in human judgments, which indicates that human annotation may not be a reliable gold standard.
PDF221December 23, 2025