TrustJudge：LLM作為評判者的不一致性及其緩解方法

摘要

大型语言模型（LLMs）作为自动化评估器（LLM-as-a-judge）的采用揭示了当前评估框架中的关键不一致性。我们识别出两种基本类型的不一致性：（1）分数比较不一致性，即在成对比较中，评分较低的响应优于评分较高的响应；（2）成对传递性不一致性，表现为循环偏好链（A>B>C>A）和等价矛盾（A=B=C≠A）。我们认为这些问题源于离散评分系统中的信息丢失以及成对评估期间的模糊平局判断。我们提出了TrustJudge，一个概率框架，通过两个关键创新来解决这些限制：1）分布敏感评分，从离散评分概率中计算连续期望，保留信息熵以实现更精确的评分；2）似然感知聚合，使用双向偏好概率或困惑度解决传递性违规。我们还形式化了当前LLM-as-a-judge框架的理论限制，并展示了TrustJudge的组件如何克服这些限制。在使用Llama-3.1-70B-Instruct作为评估器并使用我们的数据集进行评估时，TrustJudge将分数比较不一致性减少了8.43%（从23.32%降至14.89%），成对传递性不一致性减少了10.82%（从15.22%降至4.40%），同时保持了更高的评估准确性。我们的工作首次系统分析了LLM-as-a-judge范式中的评估框架不一致性，提供了理论见解和实际解决方案，以实现可靠的自动化评估。该框架在各种模型架构和规模上展示了一致的改进，使得LLM评估更加可信，而无需额外的训练或人工注释。代码可在https://github.com/TrustJudge/TrustJudge找到。

English

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.

TrustJudge：LLM作為評判者的不一致性及其緩解方法

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

摘要

Support