セルフコレクションベンチ：LLMにおけるセルフコレクションの盲点の解明と対応

要旨

大規模言語モデル（LLM）は革新的な存在となっているものの、依然として誤りを犯したり、非生産的な推論経路を探索したりすることがある。自己修正は、特に自己回帰型のLLMにとって、信頼性を高める重要な能力である。LLMはユーザー入力の誤りを識別できるが、自身の出力において同じ誤りを修正できないという体系的な「自己修正盲点」を示す。この現象を体系的に研究するため、我々は「Self-Correction Bench」を導入した。これは、3つの複雑度レベルで制御された誤り注入を通じてこの現象を測定する体系的なフレームワークである。14のモデルをテストした結果、平均64.5%の盲点率が確認された。この制限が訓練データの構成に関連していることを示す複数の証拠が見つかった。人間による訓練デモンストレーションでは、誤り修正のシーケンスではなく、誤りのない応答が主に示されており、結果フィードバックを通じて誤り修正を学習する強化学習（RL）訓練モデルとは異なる。注目すべきは、単に「待って」と付け加えるだけで盲点が89.3%減少し、この能力が存在するが活性化が必要であることが示唆された。本研究は、現在のLLMにおける重要な制限を明らかにし、その信頼性と信頼性を向上させるための潜在的な道筋を提供する。

English

Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

セルフコレクションベンチ：LLMにおけるセルフコレクションの盲点の解明と対応

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

要旨

Support