Slow-Fast Policy Optimization: LLM推論における更新前再配置

要旨

強化学習（Reinforcement Learning, RL）は、大規模言語モデル（Large Language Models, LLMs）の推論能力を向上させる上で中心的な役割を果たしています。しかし、Group Relative Policy Optimization（GRPO）のようなオン・ポリシーアルゴリズムは、初期のトレーニング段階で課題に直面することがあります。低品質なロールアウトから生じるノイズの多い勾配は、不安定な更新と非効率的な探索を引き起こします。本論文では、これらの制約を解決するためのシンプルかつ効率的なフレームワークであるSlow-Fast Policy Optimization（SFPO）を提案します。SFPOは、各ステップを3つの段階に分解します。同じバッチでの短い高速な内部ステップの軌跡、オフ・ポリシー・ドリフトを制御するリポジショニングメカニズム、そして最終的な低速の修正です。この「更新前にリポジショニング」という設計は、目的関数とロールアウトプロセスを変更せずに維持し、SFPOを既存のポリシー勾配パイプラインにプラグイン互換可能にします。大規模な実験により、SFPOが安定性を向上させ、ロールアウトを削減し、推論RLトレーニングの収束を加速することが実証されています。具体的には、数学的推論ベンチマークにおいてGRPOを最大2.80ポイント上回り、GRPOの最高精度に到達するために最大4.93回少ないロールアウトと4.19倍の壁時間短縮を達成しました。

English

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 fewer rollouts and a 4.19 reduction in wall-clock time to match GRPO's best accuracy.

Slow-Fast Policy Optimization: LLM推論における更新前再配置

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

要旨

Support