去噪强化学习：自举推理模型以从噪声前缀中恢复

摘要

强化学习已成为推动大语言模型推理能力提升的核心范式，然而现有方法大多仍依赖更强的教师模型或精心设计的困难数据集，限制了能力的可扩展性改进。本文提出DenoiseRL——一种强化学习框架，通过基于弱模型失败案例的恢复导向优化来替代外部监督。DenoiseRL不依赖更强的监督或精心设计的数据，而是直接从错误的推理轨迹中学习，将其转化为改进机遇，从而使训练更具可扩展性且减少对外部资源的依赖。这种方法产生更丰富、更多样的学习信号，提升了从非完美模型行为中进行探索的效率。因此，DenoiseRL在提高推理性能和整体训练效率的同时，减少了对昂贵数据整理或更强教师模型的需求。实验表明，在具有挑战性的数学和通用推理基准测试中，DenoiseRL持续优于强在线策略强化学习基线，并且随着训练难度增加促进了更强的自我纠错行为，为大语言模型推理能力的提升提供了一条有效且可扩展的替代路径。

English

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.