广义正确性模型:从历史模式中学习校准且模型无关的正确性预测器
Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns
September 29, 2025
作者: Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
cs.AI
摘要
生成准确且经过校准的置信度估计对于在关键任务或面向用户的应用中部署大型语言模型(LLM)至关重要,这仍是一个未解决的挑战。以往的研究常将置信度问题视为激发模型“自我认知”的能力,即LLM判断其自身答案正确性的能力;这种方法隐含地假设模型自身能够获取关于答案正确性的某些特权信息。然而,我们的实验表明,LLM试图预测其自身输出正确性的表现通常并不优于一个无关的LLM。此外,我们假设构建“正确性模型”(CM)的一个关键因素是接触目标模型的历史预测。我们提出了多种方法来注入这种历史正确性信息,从而创建广义正确性模型(GCM)。我们首先展示,GCM可以在多个LLM的正确性数据上进行训练,并学习适用于不同数据集和模型的正确性预测模式。随后,我们将CM作为研究正确性预测能力及其泛化来源的透镜,系统性地控制其训练数据,发现答案的表述方式是正确性的强预测因子。我们进一步探索了在不训练LLM的情况下注入历史信息的替代方法,发现将历史信息作为上下文示例可以有助于提高正确性预测,而事后校准则能提供校准误差的互补性降低。我们基于Qwen3-8B在5个模型家族以及MMLU和TriviaQA数据集上评估了GCM,并在下游的选择性预测任务中进行了测试,发现可靠的LLM置信度估计是一种通过系统性地编码正确性历史而非依赖自我内省获得的、可泛化且与模型无关的技能。
English
Generating accurate and calibrated confidence estimates is critical for
deploying LLMs in high-stakes or user-facing applications, and remains an open
challenge. Prior research has often framed confidence as a problem of eliciting
a model's "self-knowledge", i.e., the ability of an LLM to judge whether its
own answers are correct; this approach implicitly assumes that there is some
privileged information about the answer's correctness that is accessible to the
model itself. However, our experiments reveal that an LLM attempting to predict
the correctness of its own outputs generally performs no better than an
unrelated LLM. Moreover, we hypothesize that a key factor in building a
"Correctness Model" (CM) is exposure to a target model's historical
predictions. We propose multiple methods to inject this historical correctness
information, creating a Generalized Correctness Model (GCM). We first show that
GCMs can be trained on the correctness data from many LLMs and learn patterns
for correctness prediction applicable across datasets and models. We then use
CMs as a lens for studying the source of correctness prediction ability and its
generalization, systematically controlling their training data and finding that
answer phrasing is a strong predictor for correctness. We further explore
alternative methods of injecting history without training an LLM, finding that
including history as in-context examples can help improve correctness
prediction, and post-hoc calibration can provide complementary reductions in
calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families
and the MMLU and TriviaQA datasets, as well as on a downstream selective
prediction task, finding that reliable LLM confidence estimation is a
generalizable and model-agnostic skill learned by systematically encoding
correctness history rather than a model-specific skill reliant on
self-introspection.