当判断沦为噪音:LLM评判基准中的设计缺陷如何悄然削弱有效性
When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity
September 24, 2025
作者: Benjamin Feuer, Chiung-Yi Tseng, Astitwa Sarthak Lathe, Oussama Elachqar, John P Dickerson
cs.AI
摘要
基于大语言模型(LLM)评判的基准测试正日益用于评估复杂模型行为,然而其设计引入了传统基于真实值基准测试所不具备的失效模式。我们认为,若缺乏明确目标与可验证的构建,基准测试排名可能产生看似高置信度实则主要由噪声构成的排序。为此,我们引入了两种机制来诊断这些问题。方案遵循度量化了评判者整体裁决中由明确评估方案解释的部分,揭示了当评判者偏离其自身评分标准时未解释的变异。心理测量效度则通过聚合内部一致性与区分效度信号,量化任何基准测试运行中不可减少的不确定性。将这些工具应用于Arena-Hard Auto,我们发现流行评判者中存在严重的方案不一致性与因子崩溃现象:例如,DeepSeek-R1-32B的未解释变异超过90%,而多数标准的因子相关性高于0.93。我们还展示了Arena-Hard Auto采用的ELO风格聚合方式如何掩盖并加剧了真实的排名不确定性。我们的研究结果揭示了削弱有效性的设计缺陷,并提出了构建范围更佳、注重可靠性的LLM评判基准测试的可操作原则。我们已在https://anonymous.4open.science/r/judgment-to-noise-947D/README.md发布了相关代码。
English
LLM-judged benchmarks are increasingly used to evaluate complex model
behaviors, yet their design introduces failure modes absent in conventional
ground-truth based benchmarks. We argue that without tight objectives and
verifiable constructions, benchmark rankings can produce high-confidence
rankings that are in fact largely noise. We introduce two mechanisms to
diagnose these issues. Schematic adherence quantifies how much of a judge's
overall verdict is explained by the explicit evaluation schema, revealing
unexplained variance when judges deviate from their own rubric. Psychometric
validity aggregates internal consistency and discriminant validity signals to
quantify irreducible uncertainty in any benchmarking run. Applying these tools
to Arena-Hard Auto, we find severe schema incoherence and factor collapse
across popular judges: for example, unexplained variance exceeding 90 percent
for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We
also show that the ELO-style aggregation used by Arena-Hard Auto collapses and
masks genuine ranking uncertainty. Our results highlight design failures that
undermine validity and offer actionable principles for building better-scoped,
reliability-aware LLM-judged benchmarks. We release our code at
https://anonymous.4open.science/r/judgment-to-noise-947D/README.md