大型语言模型对自身回答过度自信

摘要

先前研究表明，指令微调后的大语言模型（LLMs）的校准性能逊于其基础预训练版本。然而，关于对话型LLMs中常用的聊天模板对其校准效果的影响，目前知之甚少。本研究通过解耦后训练算法与聊天格式的影响，探究导致这种校准偏差的机制。我们发现，虽然指令微调从根本上损害了校准性能，但聊天模板通过"所有权偏差"加剧了这一问题——模型对其自身回答的置信度显著高于对用户提供的相同回答。基于六个最新开源权重LLMs、三个基准数据集及三种置信度获取方法的广泛实验表明，模型对其自身回答的置信度赋值高出高达26%。利用这一发现，我们提出一种简单的推理时策略：在置信度获取环节将模型回答伪装为用户输入。该方法无需重新训练即可显著降低过度自信，将校准性能提升高达26%，缩小了基础模型与指令微调模型间的差距。

English

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.