판단이 잡음이 될 때: LLM 판단 벤치마크의 설계 결함이 타당성을 조용히 훼손하는 방식

초록

LLM(대형 언어 모델) 평가 벤치마크는 복잡한 모델 행동을 평가하기 위해 점점 더 많이 사용되고 있지만, 이러한 설계는 기존의 정답 기반 벤치마크에서는 존재하지 않는 실패 모드를 도입합니다. 우리는 엄격한 목표와 검증 가능한 구조가 없을 경우, 벤치마크 순위가 사실상 대부분 노이즈인 높은 신뢰도의 순위를 생성할 수 있다고 주장합니다. 이러한 문제를 진단하기 위해 두 가지 메커니즘을 소개합니다. 스키마 준수도는 평가자의 전체 판결 중 명시적 평가 스키마에 의해 설명되는 부분을 정량화하여, 평가자가 자신의 루브릭에서 벗어날 때 설명되지 않은 분산을 드러냅니다. 심리측정적 타당성은 내적 일관성과 판별 타당성 신호를 집계하여 벤치마크 실행에서 줄일 수 없는 불확실성을 정량화합니다. 이러한 도구를 Arena-Hard Auto에 적용한 결과, 인기 있는 평가자들 사이에서 심각한 스키마 비일관성과 요인 붕괴를 발견했습니다: 예를 들어, DeepSeek-R1-32B의 경우 설명되지 않은 분산이 90%를 초과하고, 대부분의 기준에서 요인 상관관계가 0.93 이상이었습니다. 또한 Arena-Hard Auto에서 사용된 ELO 스타일 집계가 진정한 순위 불확실성을 붕괴시키고 가리는 것을 보여줍니다. 우리의 결과는 타당성을 훼손하는 설계 실패를 강조하고, 더 나은 범위와 신뢰성을 고려한 LLM 평가 벤치마크를 구축하기 위한 실행 가능한 원칙을 제시합니다. 우리는 코드를 https://anonymous.4open.science/r/judgment-to-noise-947D/README.md 에 공개합니다.

English

LLM-judged benchmarks are increasingly used to evaluate complex model behaviors, yet their design introduces failure modes absent in conventional ground-truth based benchmarks. We argue that without tight objectives and verifiable constructions, benchmark rankings can produce high-confidence rankings that are in fact largely noise. We introduce two mechanisms to diagnose these issues. Schematic adherence quantifies how much of a judge's overall verdict is explained by the explicit evaluation schema, revealing unexplained variance when judges deviate from their own rubric. Psychometric validity aggregates internal consistency and discriminant validity signals to quantify irreducible uncertainty in any benchmarking run. Applying these tools to Arena-Hard Auto, we find severe schema incoherence and factor collapse across popular judges: for example, unexplained variance exceeding 90 percent for DeepSeek-R1-32B and factor correlations above 0.93 for most criteria. We also show that the ELO-style aggregation used by Arena-Hard Auto collapses and masks genuine ranking uncertainty. Our results highlight design failures that undermine validity and offer actionable principles for building better-scoped, reliability-aware LLM-judged benchmarks. We release our code at https://anonymous.4open.science/r/judgment-to-noise-947D/README.md

판단이 잡음이 될 때: LLM 판단 벤치마크의 설계 결함이 타당성을 조용히 훼손하는 방식

When Judgment Becomes Noise: How Design Failures in LLM Judge Benchmarks Silently Undermine Validity

초록

Support