언제 LLM이 자신의 실수를 인정하는가? 수정 과정에서 모델의 믿음의 역할 이해하기

초록

대규모 언어 모델(LLM)은 잘못을 인정할 수 있을까? 본 연구에서는 이전에 생성한 답변에서 오류를 인정하는 행위를 "철회(retraction)"로 정의하고, LLM이 언제 그리고 왜 철회를 선택하는지 이해하고자 한다. 먼저, 모델이 자신의 파라미터적 지식과 모순되는 잘못된 답변을 철회할지 평가하기 위해 모델별 데이터셋을 구축한다. LLM은 철회가 가능하지만, 이를 드물게만 수행한다. 우리는 철회가 모델의 내적 신념을 나타내는 이전에 확인된 지표와 밀접하게 연관되어 있음을 보인다: 모델은 사실적으로 옳다고 "믿는" 잘못된 답변을 철회하지 못한다. 조정 실험은 내적 신념이 모델의 철회에 인과적으로 영향을 미친다는 것을 추가로 입증한다. 특히, 모델이 자신의 답변을 믿지 않을 때, 이는 모델이 답변을 검증하려는 시도를 촉진할 뿐만 아니라 자기 검증 과정에서의 주의 행동도 변화시킨다. 마지막으로, 간단한 지도 미세 조정(supervised fine-tuning)이 모델이 더 정확한 내적 신념을 학습하도록 도와 철회 성능을 크게 향상시킨다는 것을 보인다. 코드와 데이터셋은 https://github.com/ayyyq/llm-retraction에서 확인할 수 있다.

English

Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.

언제 LLM이 자신의 실수를 인정하는가? 수정 과정에서 모델의 믿음의 역할 이해하기

When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

초록

Support