LLMはいつ間違いを認めるのか？モデルの信念が撤回に果たす役割の理解

要旨

大規模言語モデル（LLM）は、自らがより良い知識を持っているはずの状況で、その誤りを認めることができるだろうか？本研究では、以前に生成した回答の誤りを認める行動を「撤回（retraction）」と定義し、LLMがいつ、なぜ撤回を選択するのかを理解することを目指す。まず、モデル固有のデータセットを構築し、モデルが自身のパラメトリック知識と矛盾する誤った回答を撤回するかどうかを評価する。LLMは撤回を行うことができるが、その頻度は非常に低い。撤回は、以前に特定されたモデルの内部信念の指標と密接に関連していることを示す：モデルは、自身が事実上正しいと「信じている」誤った回答を撤回することができない。ステアリング実験では、内部信念がモデルの撤回に因果的に影響を与えることをさらに実証する。特に、モデルが自身の回答を信じていない場合、モデルは回答を検証しようとするだけでなく、自己検証中の注意行動も変化させる。最後に、シンプルな教師ありファインチューニングが、モデルにより正確な内部信念を学習させることで、撤回性能を大幅に向上させることを示す。コードとデータセットはhttps://github.com/ayyyq/llm-retractionで公開されている。

English

Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.

LLMはいつ間違いを認めるのか？モデルの信念が撤回に果たす役割の理解

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

要旨

Support