DenoiseRL: ノイズのあるプレフィックスからの回復に向けた推論モデルのブートストラップ

要旨

強化学習は大規模言語モデルの推論能力を向上させるための中心的なパラダイムとなっているが、既存の手法の多くは依然としてより強力な教師モデルや厳選された難易度の高いデータセットに依存しており、スケーラブルな能力向上が制限されている。本論文では、弱いモデルによる失敗を回復志向の最適化に変換することで、外部からの教師信号を代替する強化学習フレームワーク「DenoiseRL」を提案する。DenoiseRLは、より強力な教師信号や注意深く設計されたデータに依存する代わりに、誤った推論の軌跡から直接学習し、それらを改善の機会に変換することで、よりスケーラブルで外部リソースへの依存度が低い訓練を実現する。これにより、より豊かで多様な学習信号が得られ、不完全なモデル行動からの探索効率が向上する。結果として、DenoiseRLは高コストなデータキュレーションや強力な教師モデルの必要性を低減しつつ、推論性能と全体的な訓練効率を改善する。実験的には、DenoiseRLは競争力のある数学・一般推論ベンチマークにおいて、強力なon-policy強化学習ベースラインを一貫して上回り、訓練難易度が増すにつれてより強力な自己修正行動を促進する。これにより、大規模言語モデルの推論改善に向けた効果的かつスケーラブルな代替経路が示される。

English

Reinforcement learning has become a central paradigm for advancing reasoning in large language models, yet most existing methods still depend on stronger teacher models or heavily curated difficult datasets, limiting scalable capability improvement. In this paper, we introduce DenoiseRL, a reinforcement learning framework that substitutes external supervision with recovery-oriented optimization over failures from weak models. Instead of relying on stronger supervision or carefully engineered data, DenoiseRL learns directly from incorrect reasoning traces by converting them into opportunities for improvement, making training more scalable and less dependent on external resources. This yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models. Empirically, DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks and promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models.