ChatPaper.aiChatPaper

通过预测驱动推断实现统计可靠的基于LLM的排序评估

Statistically Reliable LLM-Based Ranking Evaluation via Prediction-Powered Inference

June 3, 2026
作者: Abhishek Divekar
cs.AI

摘要

借助PRECISE方法,我们扩展了预测驱动推断(Prediction-Powered Inference, PPI),通过将少量人工标注集与大规模大语言模型(LLM)评分集相结合,生成排序评估指标的偏差校正估计。无论LLM评分的误差模式如何,PPI均能保证无偏估计。为了使其适用于如Precision@K这类层次化指标(其中标注按文档进行,而指标按查询计算),我们将输出空间的计算复杂度从O(2^|C|)降至O(2^K)。在ESCI基准测试中,利用Claude 3 Sonnet的评分将30个人工标注扩充后,Precision@4估计值的标准误差从4.45降至3.50(相对降低21%)。在一个生产系统中,我们的框架仅凭100个人工标签和2小时领域专家标注,便正确识别了三个系统变体中的最优方案;A/B测试通过每日销售额提升407个基点验证了这一排名。
English
With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.