當判斷淪為噪音：LLM評判基準中的設計缺陷如何悄然削弱有效性

摘要

基於大型語言模型（LLM）的評判基準越來越多地被用於評估複雜模型行為，但其設計引入了傳統基於真實數據基準所沒有的失效模式。我們認為，若缺乏嚴謹的目標和可驗證的建構，基準排名可能會產生看似高置信度、實則主要由噪聲組成的結果。我們引入了兩種機制來診斷這些問題：架構遵循度量化了評判者的總體裁決中有多少是由明確的評估架構所解釋的，從而揭示評判者偏離其自身評分標準時的未解釋變異；心理測量效度則通過聚合內部一致性和區分效度信號，來量化任何基準測試運行中不可減少的的不確定性。將這些工具應用於Arena-Hard Auto，我們發現流行評判者中存在嚴重的架構不一致性和因子崩潰現象：例如，DeepSeek-R1-32B的未解釋變異超過90%，且大多數標準的因子相關性高於0.93。我們還展示了Arena-Hard Auto使用的ELO風格聚合方法如何掩蓋了真實的排名不確定性。我們的結果揭示了削弱效度的設計缺陷，並提供了構建範圍更佳、注重可靠性的LLM評判基準的可操作原則。我們在https://anonymous.4open.science/r/judgment-to-noise-947D/README.md發布了我們的代碼。

English

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at https://anonymous.4open.science/r/judgment-to-noise-947D/README.md

當判斷淪為噪音：LLM評判基準中的設計缺陷如何悄然削弱有效性

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

摘要

Support