通过预测驱动推断实现统计可靠的基于LLM的排序评估

摘要

借助PRECISE方法，我们扩展了预测驱动推断（Prediction-Powered Inference, PPI），通过将少量人工标注集与大规模大语言模型（LLM）评分集相结合，生成排序评估指标的偏差校正估计。无论LLM评分的误差模式如何，PPI均能保证无偏估计。为了使其适用于如Precision@K这类层次化指标（其中标注按文档进行，而指标按查询计算），我们将输出空间的计算复杂度从O(2^|C|)降至O(2^K)。在ESCI基准测试中，利用Claude 3 Sonnet的评分将30个人工标注扩充后，Precision@4估计值的标准误差从4.45降至3.50（相对降低21%）。在一个生产系统中，我们的框架仅凭100个人工标签和2小时领域专家标注，便正确识别了三个系统变体中的最优方案；A/B测试通过每日销售额提升407个基点验证了这一排名。

English

With PRECISE, we extended Prediction-Powered Inference to produce bias-corrected estimates of ranking evaluation metrics by combining a small human-labeled set with a large LLM-judged set. PPI is provably unbiased regardless of the LLM judge's error profile. We make it applicable to hierarchical metrics like Precision@K, where annotations are per-document but the metric is per-query, by reducing the output-space computation from O(2^|C|) to O(2^K). On the ESCI benchmark, augmenting 30 human annotations with Claude 3 Sonnet judgments reduces the standard error of Precision@4 estimates from 4.45 to 3.50 (a 21% relative reduction). In a production system, our framework correctly identified the best of three system variants from 100 human labels and 2 hours of domain-expert annotation; A/B testing confirmed this ranking with +407 bps in daily sales.