廣義正確性模型:從歷史模式中學習校準且模型無關的正確性預測器
Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns
September 29, 2025
作者: Hanqi Xiao, Vaidehi Patil, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal
cs.AI
摘要
生成準確且校準的信心估計對於在高風險或面向用戶的應用中部署大型語言模型(LLM)至關重要,這仍是一個未解的挑戰。先前的研究常將信心問題框架為誘導模型的“自我認知”,即LLM判斷自身答案是否正確的能力;這種方法隱含地假設存在某種關於答案正確性的特權信息,且該信息對模型本身是可訪問的。然而,我們的實驗表明,LLM試圖預測自身輸出的正確性時,其表現通常並不比一個無關的LLM更好。此外,我們假設構建“正確性模型”(CM)的一個關鍵因素是接觸目標模型的歷史預測。我們提出了多種方法來注入這種歷史正確性信息,從而創建一個廣義正確性模型(GCM)。我們首先展示,GCM可以基於多個LLM的正確性數據進行訓練,並學習適用於不同數據集和模型的正確性預測模式。接著,我們將CM作為研究正確性預測能力及其泛化來源的透鏡,系統地控制其訓練數據,發現答案的表述方式是預測正確性的強有力指標。我們進一步探索了在不訓練LLM的情況下注入歷史的替代方法,發現將歷史作為上下文示例包含在內有助於提高正確性預測,而事後校準則能提供互補性的校準誤差降低。我們基於Qwen3-8B在5個模型家族及MMLU和TriviaQA數據集上評估了GCM,並在一個下游選擇性預測任務中進行了測試,結果表明,可靠的LLM信心估計是一種可泛化且模型無關的技能,通過系統地編碼正確性歷史而非依賴於自我內省來習得。
English
Generating accurate and calibrated confidence estimates is critical for
deploying LLMs in high-stakes or user-facing applications, and remains an open
challenge. Prior research has often framed confidence as a problem of eliciting
a model's "self-knowledge", i.e., the ability of an LLM to judge whether its
own answers are correct; this approach implicitly assumes that there is some
privileged information about the answer's correctness that is accessible to the
model itself. However, our experiments reveal that an LLM attempting to predict
the correctness of its own outputs generally performs no better than an
unrelated LLM. Moreover, we hypothesize that a key factor in building a
"Correctness Model" (CM) is exposure to a target model's historical
predictions. We propose multiple methods to inject this historical correctness
information, creating a Generalized Correctness Model (GCM). We first show that
GCMs can be trained on the correctness data from many LLMs and learn patterns
for correctness prediction applicable across datasets and models. We then use
CMs as a lens for studying the source of correctness prediction ability and its
generalization, systematically controlling their training data and finding that
answer phrasing is a strong predictor for correctness. We further explore
alternative methods of injecting history without training an LLM, finding that
including history as in-context examples can help improve correctness
prediction, and post-hoc calibration can provide complementary reductions in
calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families
and the MMLU and TriviaQA datasets, as well as on a downstream selective
prediction task, finding that reliable LLM confidence estimation is a
generalizable and model-agnostic skill learned by systematically encoding
correctness history rather than a model-specific skill reliant on
self-introspection.