RLoop: 反復的ポリシー初期化による強化学習の自己改善フレームワーク

要旨

検証可能な報酬に対する強化学習（RLVR）は大規模推論モデルの訓練に強力である一方、その訓練ダイナミクスには重大な課題、すなわちRL過学習が内在する。これはモデルが訓練報酬を獲得する一方で一般化性能を失う現象である。我々の分析によれば、これは政策の過度の特化と、訓練中に生成される多様な解法の破滅的忘却によって引き起こされる。標準的な最適化手法は、この貴重なステップ間の政策多様性を捨て去ってしまう。この問題に対処するため、我々は反復的政策初期化に基づく自己改善型フレームワーク「RLoop」を提案する。RLoopは標準的な訓練プロセスを好循環に変換する。まずRLを用いて与えられた政策から解空間を探索し、成功した軌跡をフィルタリングしてエキスパートデータセットを作成する。このデータセットはRejection-sampling Fine-Tuning（RFT）を介して初期政策を改良し、次の反復のための優れた出発点を生成する。この探索と利用の反復的再初期化によるループは、一時的な政策の変動を堅牢な性能向上へと効果的に変換する。実験により、RLoopが忘却を緩和し一般化性能を大幅に改善することが示され、バニラRLと比較して平均精度で9%、pass@32で15%以上を向上させた。

English

While Reinforcement Learning for Verifiable Rewards (RLVR) is powerful for training large reasoning models, its training dynamics harbor a critical challenge: RL overfitting, where models gain training rewards but lose generalization. Our analysis reveals this is driven by policy over-specialization and catastrophic forgetting of diverse solutions generated during training. Standard optimization discards this valuable inter-step policy diversity. To address this, we introduce RLoop, a self-improving framework built on iterative policy initialization. RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset. This dataset is used via Rejection-sampling Fine-Tuning (RFT) to refine the initial policy, creating a superior starting point for the next iteration. This loop of exploration and exploitation via iterative re-initialization effectively converts transient policy variations into robust performance gains. Our experiments show RLoop mitigates forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.

RLoop: 反復的ポリシー初期化による強化学習の自己改善フレームワーク

RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization

要旨

Support