因果判断评估:面向大语言模型系统的校准替代指标
Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems
December 11, 2025
作者: Eddie Landesberg
cs.AI
摘要
大语言模型即评委(LLM-as-judge)评估已成为扩展模型评估的事实标准,但该方法在统计学上存在缺陷:未经校准的分数可能导致偏好倒置,基于未校准分数的朴素置信区间覆盖率接近零,而重要性加权估计量在有限重叠条件下会失效——尽管有效样本量(ESS)很高。我们提出因果评委评估框架(CJE),可同时解决这三类问题。在经筛选的4,961条Chatbot Arena提示词(从5,000条过滤后)上,CJE通过仅使用5%的黄金标准标签(约250条)对成本降低16倍的评委进行校准,在完整样本量下实现了99%的成对排序准确率(各配置平均达94%),达到黄金标准质量的同时将成本降低14倍(针对5项策略的排序)。CJE包含三个核心组件:(i)AutoCal-R:通过保均值等渗回归实现奖励校准;(ii)SIMCal-W:通过S单调候选模型的堆叠实现权重稳定;(iii)黄金标准不确定性感知(OUA)推断,将校准不确定性传递至置信区间。我们形式化覆盖受限效率(CLE)诊断指标,揭示为何即使ESS超过90%时IPS类估计量仍会失效:记录策略极少访问目标策略集中的区域。关键发现:由于权重不稳定性,SNIPS即使在奖励校准后仍出现排序倒置(38%成对错误率,肯德尔τ系数为负);经权重稳定后的校准IPS准确率仍接近随机水平(47%),与CLE诊断一致;OUA将覆盖率从接近零提升至约86%(直接法)和约96%(堆叠双重稳健法),而朴素区间严重欠覆盖。
English
LLM-as-judge evaluation has become the de facto standard for scaling model assessment, but the practice is statistically unsound: uncalibrated scores can invert preferences, naive confidence intervals on uncalibrated scores achieve near-0% coverage, and importance-weighted estimators collapse under limited overlap despite high effective sample size (ESS). We introduce Causal Judge Evaluation (CJE), a framework that fixes all three failures. On n=4,961 Chatbot Arena prompts (after filtering from 5k), CJE achieves 99% pairwise ranking accuracy at full sample size (94% averaged across configurations), matching oracle quality, at 14x lower cost (for ranking 5 policies) by calibrating a 16x cheaper judge on just 5% oracle labels (~250 labels). CJE combines three components: (i) AutoCal-R, reward calibration via mean-preserving isotonic regression; (ii) SIMCal-W, weight stabilization via stacking of S-monotone candidates; and (iii) Oracle-Uncertainty Aware (OUA) inference that propagates calibration uncertainty into confidence intervals. We formalize the Coverage-Limited Efficiency (CLE) diagnostic, which explains why IPS-style estimators fail even when ESS exceeds 90%: the logger rarely visits regions where target policies concentrate. Key findings: SNIPS inverts rankings even with reward calibration (38% pairwise, negative Kendall's tau) due to weight instability; calibrated IPS remains near-random (47%) despite weight stabilization, consistent with CLE; OUA improves coverage from near-0% to ~86% (Direct) and ~96% (stacked-DR), where naive intervals severely under-cover.