자기 수정 벤치: 대형 언어 모델의 자기 수정 맹점 드러내기와 해결

초록

대규모 언어 모델(LLM)은 혁신적인 발전을 이루었음에도 불구하고 여전히 실수를 저지르고 비생산적인 추론 경로를 탐색할 수 있습니다. 자기 수정은 특히 자기회귀적 LLM의 경우 신뢰할 수 있는 LLM을 위한 중요한 능력입니다. LLM은 사용자 입력의 오류를 식별할 수 있지만, '자기 수정 맹점'이라는 체계적인 문제를 보입니다. 이는 자신의 출력에서 동일한 오류를 수정하지 못하는 현상을 말합니다. 이 현상을 체계적으로 연구하기 위해, 우리는 세 가지 복잡성 수준에서 통제된 오류 주입을 통해 이를 측정하는 체계적인 프레임워크인 Self-Correction Bench를 도입했습니다. 14개의 모델을 테스트한 결과, 평균 64.5%의 맹점률을 발견했습니다. 이 한계가 훈련 데이터 구성과 관련이 있다는 여러 증거를 발견했습니다. 인간의 훈련 시연은 주로 오류가 없는 응답을 보여주는 반면, 결과 피드백을 통해 오류 수정을 학습하는 강화 학습(RL) 모델과는 대조적입니다. 흥미롭게도, 단순히 "잠깐"이라는 단어를 추가하는 것만으로도 맹점이 89.3% 감소했으며, 이는 해당 능력이 존재하지만 활성화가 필요함을 시사합니다. 우리의 연구는 현재 LLM의 중요한 한계를 강조하고, 그들의 신뢰성과 신뢰성을 향상시킬 수 있는 잠재적인 방향을 제시합니다.

English

Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

자기 수정 벤치: 대형 언어 모델의 자기 수정 맹점 드러내기와 해결

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

초록

Support