TrustJudge: LLM-as-a-Judge의 불일치 문제와 이를 완화하는 방법

초록

대형 언어 모델(LLM)을 자동 평가자로 활용하는(LLM-as-a-judge) 방식은 현재의 평가 프레임워크에서 중요한 불일치 문제를 드러냈습니다. 우리는 두 가지 근본적인 유형의 불일치를 확인했습니다: (1) 점수 비교 불일치(Score-Comparison Inconsistency), 즉 낮은 점수를 받은 응답이 높은 점수를 받은 응답보다 쌍별 비교에서 더 나은 성능을 보이는 경우, 그리고 (2) 쌍별 전이성 불일치(Pairwise Transitivity Inconsistency), 이는 순환적 선호 체인(A>B>C>A)과 동등성 모순(A=B=C≠A)으로 나타납니다. 이러한 문제는 이산적 평점 시스템에서의 정보 손실과 쌍별 평가 중 모호한 동점 판단에서 비롯된다고 주장합니다. 우리는 이러한 한계를 해결하기 위해 TrustJudge라는 확률적 프레임워크를 제안합니다. 이 프레임워크는 두 가지 주요 혁신을 통해 문제를 해결합니다: 1) 이산적 평점 확률로부터 연속적인 기대값을 계산하는 분포 민감적 점수 산정(distribution-sensitive scoring)으로, 정보 엔트로피를 보존하여 더 정확한 점수 산정을 가능하게 하고, 2) 쌍별 선호 확률 또는 퍼플렉서티(perplexity)를 사용하여 전이성 위반을 해결하는 가능성 기반 집계(likelihood-aware aggregation)입니다. 또한, 우리는 현재의 LLM-as-a-judge 프레임워크의 이론적 한계를 공식화하고, TrustJudge의 구성 요소가 이를 어떻게 극복하는지 보여줍니다. Llama-3.1-70B-Instruct를 평가자로 사용한 데이터셋에서의 평가 결과, TrustJudge는 점수 비교 불일치를 8.43%(23.32%에서 14.89%로) 감소시키고, 쌍별 전이성 불일치를 10.82%(15.22%에서 4.40%로) 감소시키면서도 더 높은 평가 정확도를 유지했습니다. 우리의 연구는 LLM-as-a-judge 패러다임에서 평가 프레임워크의 불일치를 체계적으로 분석한 첫 번째 연구로, 신뢰할 수 있는 자동 평가를 위한 이론적 통찰과 실용적인 해결책을 제공합니다. 이 프레임워크는 다양한 모델 아키텍처와 규모에서 일관된 개선을 보여주며, 추가적인 학습이나 인간 주석 없이도 더 신뢰할 수 있는 LLM 평가를 가능하게 합니다. 코드는 https://github.com/TrustJudge/TrustJudge에서 확인할 수 있습니다.

English

The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge's components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.

TrustJudge: LLM-as-a-Judge의 불일치 문제와 이를 완화하는 방법

TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them

초록

Support