大規模言語モデルは自身の応答に過信している

要旨

従来の研究では、指示チューニングされた大規模言語モデル（LLM）は、ベースとなる事前学習済みモデルよりも較正（キャリブレーション）が不十分であることが示されている。しかし、会話型LLMの較正に頻繁に使用されるチャットテンプレートが与える影響については、ほとんど知られていない。本研究では、ポストトレーニングアルゴリズムとチャット形式の効果を分離することで、この較正不良を引き起こすメカニズムを調査する。指示チューニングが本質的に較正を損なう一方で、チャットテンプレートは「所有バイアス（ownership bias）」を通じて問題を悪化させることを発見した。すなわち、モデルはユーザーが提供した同一の回答よりも、自身の回答に対して有意に高い確信度を示すのである。最近の6つのオープンウェイトLLM、3つのベンチマーク、および3つの確信度抽出法にわたる広範な実験により、モデルは自身の応答に対して最大26%高い確信度を割り当てることが示された。この知見を活用し、推論時に確信度を抽出する際にモデルの回答をユーザー入力としてフレーミングするというシンプルな戦略を提案する。このアプローチは、再トレーニングを必要とせずに過信を大幅に低減し、較正を最大26%改善することで、ベースモデルと指示チューニングモデル間のギャップを縮小する。

English

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.