基于项目反应理论的大语言模型裁判可靠性诊断
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
January 31, 2026
作者: Junhyuk Choi, Sohhyung Park, Chanhee Cho, Hyeonchu Park, Bugeun Kim
cs.AI
摘要
尽管大语言模型即评委(LLM-as-a-Judge)技术已被广泛应用于自动化评估,但现有的验证实践主要停留在观测输出层面,难以深入揭示LLM评委是否具备稳定可靠的测量工具特性。为突破这一局限,我们基于项目反应理论(IRT)提出了一个两阶段诊断框架,用于评估LLM即评委的可靠性。该框架采用IRT的等级反应模型(GRM),从两个互补维度对可靠性进行形式化定义:(1)内在一致性,即提示词变化下测量行为的稳定性;(2)人类对齐度,反映模型评估与人类质量判断的吻合程度。我们通过该框架对多种LLM评委展开实证研究,结果表明采用IRT-GRM方法可为系统性诊断评估结果生成可解释的信号。这些信号为验证LLM即评委的可靠性及识别不可靠性的潜在成因提供了实用指导。
English
While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.