TrustJudge:大语言模型作为评判者的不一致性及其缓解策略
TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
September 25, 2025
作者: Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
cs.AI
摘要
采用大型语言模型(LLMs)作为自动化评估工具(LLM-as-a-judge)揭示了当前评估框架中的关键不一致性。我们识别出两种基本类型的不一致性:(1)分数比较不一致性,即在成对比较中,评分较低的响应优于评分较高的响应;(2)成对传递性不一致性,表现为循环偏好链(A>B>C>A)和等价矛盾(A=B=C≠A)。我们认为这些问题源于离散评分系统中的信息丢失以及成对评估过程中的模糊平局判断。我们提出了TrustJudge,一个概率框架,通过两项关键创新解决这些局限性:1)分布敏感评分,从离散评分概率中计算连续期望,保留信息熵以实现更精确的评分;2)基于似然的聚合,利用双向偏好概率或困惑度解决传递性违规问题。我们还形式化了当前LLM-as-a-judge框架的理论局限性,并展示了TrustJudge的组件如何克服这些局限性。在使用Llama-3.1-70B-Instruct作为评估工具并基于我们的数据集进行评估时,TrustJudge将分数比较不一致性降低了8.43%(从23.32%降至14.89%),成对传递性不一致性降低了10.82%(从15.22%降至4.40%),同时保持了更高的评估准确性。我们的工作首次系统分析了LLM-as-a-judge范式中的评估框架不一致性,为可靠的自动化评估提供了理论见解和实际解决方案。该框架在各种模型架构和规模上均表现出持续的改进,无需额外训练或人工标注即可实现更可信的LLM评估。代码可在https://github.com/TrustJudge/TrustJudge找到。
English
The adoption of Large Language Models (LLMs) as automated evaluators
(LLM-as-a-judge) has revealed critical inconsistencies in current evaluation
frameworks. We identify two fundamental types of inconsistencies: (1)
Score-Comparison Inconsistency, where lower-rated responses outperform
higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity
Inconsistency, manifested through circular preference chains (A>B>C>A) and
equivalence contradictions (A=B=C\neq A). We argue that these issues come from
information loss in discrete rating systems and ambiguous tie judgments during
pairwise evaluation. We propose TrustJudge, a probabilistic framework that
addresses these limitations through two key innovations: 1)
distribution-sensitive scoring that computes continuous expectations from
discrete rating probabilities, preserving information entropy for more precise
scoring, and 2) likelihood-aware aggregation that resolves transitivity
violations using bidirectional preference probabilities or perplexity. We also
formalize the theoretical limitations of current LLM-as-a-judge frameworks and
demonstrate how TrustJudge's components overcome them. When evaluated with
Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces
Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise
Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining
higher evaluation accuracy. Our work provides the first systematic analysis of
evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both
theoretical insights and practical solutions for reliable automated assessment.
The framework demonstrates consistent improvements across various model
architectures and scales, enabling more trustworthy LLM evaluation without
requiring additional training or human annotations. The codes can be found at
https://github.com/TrustJudge/TrustJudge.