大型语言模型的决策是否忠实于言语置信度？

摘要

大型语言模型（LLMs）能够生成令人惊讶的复杂自我不确定性评估。然而，这种表达出的置信度在多大程度上与模型的推理、知识或决策机制相关联，目前尚不明确。为验证这一点，我们推出RiskEval评估框架：该框架旨在检验模型是否会根据不同的错误惩罚力度调整其弃答策略。通过对多个前沿模型的评估，我们发现了关键性脱节现象：模型在表述语言置信度时不具备成本意识，在高惩罚条件下决定参与或弃答时也缺乏策略响应能力。即使极端惩罚使得频繁弃答成为数学上的最优策略，模型几乎从不选择弃答，导致效用崩溃。这表明，仅靠经过校准的语言置信度评分可能不足以构建可信赖且可解释的AI系统，因为现有模型缺乏将不确定性信号转化为最优且风险敏感决策的策略能动性。

English

Large Language Models (LLMs) can produce surprisingly sophisticated estimates of their own uncertainty. However, it remains unclear to what extent this expressed confidence is tied to the reasoning, knowledge, or decision making of the model. To test this, we introduce RiskEval: a framework designed to evaluate whether models adjust their abstention policies in response to varying error penalties. Our evaluation of several frontier models reveals a critical dissociation: models are neither cost-aware when articulating their verbal confidence, nor strategically responsive when deciding whether to engage or abstain under high-penalty conditions. Even when extreme penalties render frequent abstention the mathematically optimal strategy, models almost never abstain, resulting in utility collapse. This indicates that calibrated verbal confidence scores may not be sufficient to create trustworthy and interpretable AI systems, as current models lack the strategic agency to convert uncertainty signals into optimal and risk-sensitive decisions.

大型语言模型的决策是否忠实于言语置信度？

Are LLM Decisions Faithful to Verbal Confidence?

摘要

Support