广义正确性模型：从历史模式中学习校准且模型无关的正确性预测器

摘要

生成准确且经过校准的置信度估计对于在关键任务或面向用户的应用中部署大型语言模型（LLM）至关重要，这仍是一个未解决的挑战。以往的研究常将置信度问题视为激发模型“自我认知”的能力，即LLM判断其自身答案正确性的能力；这种方法隐含地假设模型自身能够获取关于答案正确性的某些特权信息。然而，我们的实验表明，LLM试图预测其自身输出正确性的表现通常并不优于一个无关的LLM。此外，我们假设构建“正确性模型”（CM）的一个关键因素是接触目标模型的历史预测。我们提出了多种方法来注入这种历史正确性信息，从而创建广义正确性模型（GCM）。我们首先展示，GCM可以在多个LLM的正确性数据上进行训练，并学习适用于不同数据集和模型的正确性预测模式。随后，我们将CM作为研究正确性预测能力及其泛化来源的透镜，系统性地控制其训练数据，发现答案的表述方式是正确性的强预测因子。我们进一步探索了在不训练LLM的情况下注入历史信息的替代方法，发现将历史信息作为上下文示例可以有助于提高正确性预测，而事后校准则能提供校准误差的互补性降低。我们基于Qwen3-8B在5个模型家族以及MMLU和TriviaQA数据集上评估了GCM，并在下游的选择性预测任务中进行了测试，发现可靠的LLM置信度估计是一种通过系统性地编码正确性历史而非依赖自我内省获得的、可泛化且与模型无关的技能。

English

Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.

广义正确性模型：从历史模式中学习校准且模型无关的正确性预测器

Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

摘要

Support