TrustJudge:LLM作為評判者的不一致性及其緩解方法
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
September 25, 2025
作者: Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
cs.AI
摘要
大型语言模型(LLMs)作为自动化评估器(LLM-as-a-judge)的采用揭示了当前评估框架中的关键不一致性。我们识别出两种基本类型的不一致性:(1)分数比较不一致性,即在成对比较中,评分较低的响应优于评分较高的响应;(2)成对传递性不一致性,表现为循环偏好链(A>B>C>A)和等价矛盾(A=B=C≠A)。我们认为这些问题源于离散评分系统中的信息丢失以及成对评估期间的模糊平局判断。我们提出了TrustJudge,一个概率框架,通过两个关键创新来解决这些限制:1)分布敏感评分,从离散评分概率中计算连续期望,保留信息熵以实现更精确的评分;2)似然感知聚合,使用双向偏好概率或困惑度解决传递性违规。我们还形式化了当前LLM-as-a-judge框架的理论限制,并展示了TrustJudge的组件如何克服这些限制。在使用Llama-3.1-70B-Instruct作为评估器并使用我们的数据集进行评估时,TrustJudge将分数比较不一致性减少了8.43%(从23.32%降至14.89%),成对传递性不一致性减少了10.82%(从15.22%降至4.40%),同时保持了更高的评估准确性。我们的工作首次系统分析了LLM-as-a-judge范式中的评估框架不一致性,提供了理论见解和实际解决方案,以实现可靠的自动化评估。该框架在各种模型架构和规模上展示了一致的改进,使得LLM评估更加可信,而无需额外的训练或人工注释。代码可在https://github.com/TrustJudge/TrustJudge找到。
English
The adoption of Large Language Models (LLMs) as automated evaluators
(LLM-as-a-judge) has revealed critical inconsistencies in current evaluation
frameworks. We identify two fundamental types of inconsistencies: (1)
Score-Comparison Inconsistency, where lower-rated responses outperform
higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity
Inconsistency, manifested through circular preference chains (A>B>C>A) and
equivalence contradictions (A=B=C\neq A). We argue that these issues come from
information loss in discrete rating systems and ambiguous tie judgments during
pairwise evaluation. We propose TrustJudge, a probabilistic framework that
addresses these limitations through two key innovations: 1)
distribution-sensitive scoring that computes continuous expectations from
discrete rating probabilities, preserving information entropy for more precise
scoring, and 2) likelihood-aware aggregation that resolves transitivity
violations using bidirectional preference probabilities or perplexity. We also
formalize the theoretical limitations of current LLM-as-a-judge frameworks and
demonstrate how TrustJudge's components overcome them. When evaluated with
Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces
Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise
Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining
higher evaluation accuracy. Our work provides the first systematic analysis of
evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both
theoretical insights and practical solutions for reliable automated assessment.
The framework demonstrates consistent improvements across various model
architectures and scales, enabling more trustworthy LLM evaluation without
requiring additional training or human annotations. The codes can be found at
https://github.com/TrustJudge/TrustJudge.