廣義正確性模型：從歷史模式中學習校準且模型無關的正確性預測器

摘要

生成準確且校準的信心估計對於在高風險或面向用戶的應用中部署大型語言模型（LLM）至關重要，這仍是一個未解的挑戰。先前的研究常將信心問題框架為誘導模型的“自我認知”，即LLM判斷自身答案是否正確的能力；這種方法隱含地假設存在某種關於答案正確性的特權信息，且該信息對模型本身是可訪問的。然而，我們的實驗表明，LLM試圖預測自身輸出的正確性時，其表現通常並不比一個無關的LLM更好。此外，我們假設構建“正確性模型”（CM）的一個關鍵因素是接觸目標模型的歷史預測。我們提出了多種方法來注入這種歷史正確性信息，從而創建一個廣義正確性模型（GCM）。我們首先展示，GCM可以基於多個LLM的正確性數據進行訓練，並學習適用於不同數據集和模型的正確性預測模式。接著，我們將CM作為研究正確性預測能力及其泛化來源的透鏡，系統地控制其訓練數據，發現答案的表述方式是預測正確性的強有力指標。我們進一步探索了在不訓練LLM的情況下注入歷史的替代方法，發現將歷史作為上下文示例包含在內有助於提高正確性預測，而事後校準則能提供互補性的校準誤差降低。我們基於Qwen3-8B在5個模型家族及MMLU和TriviaQA數據集上評估了GCM，並在一個下游選擇性預測任務中進行了測試，結果表明，可靠的LLM信心估計是一種可泛化且模型無關的技能，通過系統地編碼正確性歷史而非依賴於自我內省來習得。

English

Generating accurate and calibrated confidence estimates is critical for deploying LLMs in high-stakes or user-facing applications, and remains an open challenge. Prior research has often framed confidence as a problem of eliciting a model's "self-knowledge", i.e., the ability of an LLM to judge whether its own answers are correct; this approach implicitly assumes that there is some privileged information about the answer's correctness that is accessible to the model itself. However, our experiments reveal that an LLM attempting to predict the correctness of its own outputs generally performs no better than an unrelated LLM. Moreover, we hypothesize that a key factor in building a "Correctness Model" (CM) is exposure to a target model's historical predictions. We propose multiple methods to inject this historical correctness information, creating a Generalized Correctness Model (GCM). We first show that GCMs can be trained on the correctness data from many LLMs and learn patterns for correctness prediction applicable across datasets and models. We then use CMs as a lens for studying the source of correctness prediction ability and its generalization, systematically controlling their training data and finding that answer phrasing is a strong predictor for correctness. We further explore alternative methods of injecting history without training an LLM, finding that including history as in-context examples can help improve correctness prediction, and post-hoc calibration can provide complementary reductions in calibration error. We evaluate GCMs based on Qwen3-8B across 5 model families and the MMLU and TriviaQA datasets, as well as on a downstream selective prediction task, finding that reliable LLM confidence estimation is a generalizable and model-agnostic skill learned by systematically encoding correctness history rather than a model-specific skill reliant on self-introspection.

廣義正確性模型：從歷史模式中學習校準且模型無關的正確性預測器

Generalized Correctness Models: Learning Calibrated and Model-Agnostic Correctness Predictors from Historical Patterns

摘要

Support