自我校正基準：揭示並解決大型語言模型中的自我校正盲點

摘要

尽管大型语言模型（LLMs）已展现出变革性力量，它们仍会犯错，并可能探索无效的推理路径。自我纠错是构建可信赖LLM，尤其是自回归LLM的重要能力。虽然LLMs能够识别用户输入中的错误，但它们却表现出一种系统性的“自我纠错盲点”——无法纠正自身输出中的相同错误。为了系统研究这一现象，我们引入了自我纠错基准（Self-Correction Bench），这是一个通过三个复杂度级别进行可控错误注入来测量此现象的系统框架。测试14个模型后，我们发现平均盲点率高达64.5%。多项证据表明，这一局限与训练数据构成有关：人类训练示范主要展示无错误的响应，而非错误纠正序列，这与通过结果反馈学习错误纠正的强化学习训练模型形成对比。值得注意的是，仅简单附加“等待”一词即可将盲点减少89.3%，表明这一能力存在但需激活。我们的研究揭示了当前LLMs的一个关键局限，并为其可靠性与可信度的提升提供了潜在途径。

English

Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

自我校正基準：揭示並解決大型語言模型中的自我校正盲點

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

摘要

Support