大型語言模型對自身回應過於自信

摘要

先前研究表明，指令微调后的大语言模型（LLMs）校准效果不如其基础预训练版本。然而，常用的聊天模板对对话式LLM校准的影响仍鲜有探讨。本研究通过分离后训练算法与聊天格式的效应，深入探究导致校准偏差的机制。我们发现：尽管指令微调本质上损害了校准性能，但聊天模板通过"所有权偏差"进一步加剧了问题——模型对其自身答案的置信度显著高于用户提供的相同内容。在六个最新开源权重LLM、三个基准测试及三种置信度启发方法的大量实验中，模型对自身回答的置信度最高可提升26%。基于这一发现，我们提出一种简单的推理阶段策略：在置信度启发时将模型回答设定为用户输入。该方法无需重新训练即可有效降低过度自信，并将校准性能最多提升26%，从而缩小基础模型与指令微调模型之间的差距。

English

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.