대규모 언어 모델은 자신의 응답에 대해 과잉 자신감을 가진다

초록

이전 연구들은 지시 조정된 대규모 언어 모델(LLM)이 기본 사전 학습 모델에 비해 캘리브레이션(보정)이 덜 잘 되어 있음을 보여주었다. 그러나 대화형 LLM의 캘리브레이션에 자주 사용되는 채팅 템플릿이 미치는 영향에 대해서는 알려진 바가 거의 없다. 본 연구에서는 사후 학습 알고리즘과 채팅 형식의 효과를 분리하여 이러한 캘리브레이션 오류를 유발하는 메커니즘을 조사한다. 지시 조정이 본질적으로 캘리브레이션을 해치는 반면, 채팅 템플릿은 '소유 편향(ownership bias)'을 통해 문제를 악화시킨다는 사실을 발견했다. 즉, 모델은 사용자가 제공한 동일한 답변보다 자신의 답변에 대해 현저히 더 높은 신뢰도를 보인다. 최신 오픈 가중치 LLM 6종, 세 가지 벤치마크, 세 가지 신뢰도 도출 방법에 걸친 광범위한 실험 결과, 모델은 자신의 응답에 대해 최대 26% 더 높은 신뢰도를 할당하는 것으로 나타났다. 이 통찰을 활용하여, 신뢰도 도출 중 모델의 답변을 사용자 입력인 것처럼 프레이밍하는 간단한 추론 시점 전략을 제안한다. 이 접근법은 재학습 없이도 과잉신뢰를 크게 줄이고 캘리브레이션을 최대 26% 향상시켜, 기본 모델과 지시 조정 모델 간의 격차를 좁힌다.

English

Prior work has shown that instruction-tuned large language models (LLMs) are less well calibrated than their base pre-trained counterparts. However, little is known about the frequently used chat template's effect on the calibration of conversational LLMs. In this work, we investigate the mechanisms driving this miscalibration by decoupling the effects of the post-training algorithm and the chat format. We find that, while instruction tuning fundamentally harms calibration, the chat template aggravates the issue through an "ownership bias" -- models are significantly more confident in their own answers than in identical answers provided by a user. Extensive experiments across six recent open-weight LLMs, three benchmarks, and three confidence elicitation methods show that models assign up to 26% higher confidence to their own responses. Leveraging this insight, we propose a simple inference-time strategy: framing the model's answer as user input during confidence elicitation. This approach significantly reduces overconfidence and improves calibration by up to 26% without the need for retraining, narrowing the gap between base and instruction-tuned models.