自校正基准：揭示并解决大语言模型中的自校正盲点

摘要

尽管大型语言模型（LLMs）已展现出变革性力量，它们仍会犯错并可能探索低效的推理路径。自我纠错能力对于构建可信赖的LLM，尤其是自回归型LLM而言至关重要。虽然LLMs能够识别用户输入中的错误，但它们却表现出一种系统性的“自我纠错盲区”——无法纠正自身输出中的相同错误。为系统研究这一现象，我们引入了自我纠错基准（Self-Correction Bench），这是一个通过三个复杂度层次上的受控错误注入来量化该现象的系统框架。测试14个模型后，我们发现平均盲区率高达64.5%。多项证据表明，这一局限与训练数据构成有关：人类训练示范主要展示无错误的响应，而非错误纠正序列，这与通过结果反馈学习错误纠正的强化学习训练模型形成对比。值得注意的是，仅简单添加“等待”提示便使盲区减少了89.3%，暗示这一能力虽存在但需被激活。我们的研究揭示了当前LLMs的一个关键局限，并为提升其可靠性与可信度提供了潜在路径。

English

Although large language models (LLMs) have become transformative, they still make mistakes and can explore unproductive reasoning paths. Self-correction is an important capability for a trustworthy LLM, particularly an autoregressive LLM. While LLMs can identify error in user input, they exhibit a systematic 'Self-Correction Blind Spot' - failing to correct identical error in their own outputs. To systematically study this phenomenon, we introduce Self-Correction Bench, a systematic framework to measure this phenomenon through controlled error injection at three complexity levels. Testing 14 models, we find an average 64.5% blind spot rate. We find multiple evidences that this limitation relates to training data composition: human training demonstrations predominantly show error-free responses rather than error-correction sequences, unlike RL-trained models that learn error correction through outcome feedback. Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting that the capability exists but requires activation. Our work highlights a critical limitation in current LLMs and offers potential avenues for improving their reliability and trustworthiness.

自校正基准：揭示并解决大语言模型中的自校正盲点

Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs

摘要

Support