大型語言模型何時會承認錯誤?探討模型信念在撤回中的角色
When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
May 22, 2025
作者: Yuqing Yang, Robin Jia
cs.AI
摘要
大型語言模型(LLMs)是否能在應當知曉的情況下承認自己的錯誤?在本研究中,我們將模型對先前生成答案中錯誤的承認行為定義為「撤回」,並旨在理解LLMs何時以及為何選擇撤回。我們首先構建了模型特定的數據集,以評估模型是否會撤回與其自身參數化知識相矛盾的不正確答案。雖然LLMs具備撤回的能力,但它們這樣做的頻率卻很低。我們證明,撤回行為與先前識別的模型內部信念指標密切相關:模型未能撤回那些它們「相信」事實正確的錯誤答案。引導實驗進一步表明,內部信念因果性地影響模型的撤回行為。特別是,當模型不相信其答案時,這不僅促使模型嘗試驗證答案,還改變了自我驗證過程中的注意力行為。最後,我們展示瞭簡單的監督微調通過幫助模型學習更準確的內部信念,顯著提升了撤回性能。代碼和數據集可在https://github.com/ayyyq/llm-retraction獲取。
English
Can large language models (LLMs) admit their mistakes when they should know
better? In this work, we define the behavior of acknowledging errors in
previously generated answers as "retraction" and aim to understand when and why
LLMs choose to retract. We first construct model-specific datasets to evaluate
whether a model will retract an incorrect answer that contradicts its own
parametric knowledge. While LLMs are capable of retraction, they do so only
infrequently. We demonstrate that retraction is closely tied to previously
identified indicators of models' internal belief: models fail to retract wrong
answers that they "believe" to be factually correct. Steering experiments
further demonstrate that internal belief causally influences model retraction.
In particular, when the model does not believe its answer, this not only
encourages the model to attempt to verify the answer, but also alters attention
behavior during self-verification. Finally, we demonstrate that simple
supervised fine-tuning significantly improves retraction performance by helping
the model learn more accurate internal beliefs. Code and datasets are available
on https://github.com/ayyyq/llm-retraction.Summary
AI-Generated Summary