大型语言模型何时承认错误？理解模型信念在撤回中的作用

摘要

大型语言模型（LLMs）在明知有误时能否承认错误？在本研究中，我们将模型对先前生成答案中错误的承认行为定义为“撤回”，并旨在探究LLMs何时及为何选择撤回。我们首先构建了模型特定的数据集，以评估模型是否会撤回与其自身参数化知识相矛盾的错误答案。尽管LLMs具备撤回能力，但这一行为却并不常见。我们证明，撤回行为与先前识别出的模型内部信念指标紧密相关：模型未能撤回那些其“认为”事实正确的错误答案。引导实验进一步表明，内部信念对模型撤回行为具有因果影响。特别是，当模型不确信其答案时，不仅会促使模型尝试验证答案，还会改变自我验证过程中的注意力行为。最后，我们展示了简单的监督微调通过帮助模型学习更准确的内部信念，显著提升了撤回性能。代码与数据集已发布于https://github.com/ayyyq/llm-retraction。

English

Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.

大型语言模型何时承认错误？理解模型信念在撤回中的作用

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

摘要

Support