DenoiseRL：自舉推理模型以從噪聲前綴中恢復

摘要

強化學習已成為促進大型語言模型推理能力的核心範疇，然而現有多數方法仍依賴於更強大的教師模型或經過嚴格篩選的困難資料集，限制了能力的可擴展性提升。本文提出DenoiseRL框架，這是一種強化學習架構，透過從弱模型失敗中進行恢復導向的最佳化，來取代外部監督機制。與其依賴更強的監督訊號或精心設計的資料，DenoiseRL直接從錯誤的推理軌跡中學習，將其轉化為改善的機會，從而提升訓練的可擴展性並降低對外部資源的依賴。這能產生更豐富且多樣化的學習訊號，從不完美的模型行為中改善探索效率。最終，DenoiseRL不僅提升了推理能力與整體訓練效率，同時減少了對昂貴資料篩選或更強教師模型的需求。實驗結果顯示，DenoiseRL在競爭性數學與通用推理基準測試中，持續優於強基線的在策略強化學習方法；隨著訓練難度增加，其亦促進更強的自我修正行為，凸顯出一條有效且可擴展的替代路徑，以強化大型語言模型的推理能力。

English

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.