慢速-快速策略优化：大语言模型推理中的先调整后更新机制

摘要

强化学习（RL）已成为提升大型语言模型（LLMs）推理能力的核心方法。然而，诸如群体相对策略优化（GRPO）等在线策略算法在训练初期常面临挑战：低质量探索产生的噪声梯度导致更新不稳定和探索效率低下。为此，我们提出了慢快策略优化（SFPO），这一简洁高效的框架通过将每一步分解为三个阶段来应对这些局限：同一批次内的短快轨迹内步、控制离策略漂移的重定位机制，以及最终的慢速校正。这种“先重定位后更新”的设计保持了目标和探索过程不变，使SFPO能够即插即用地兼容现有的策略梯度流程。大量实验表明，SFPO持续提升了训练稳定性，减少了探索次数，并加速了推理强化学习的收敛速度。具体而言，在数学推理基准测试中，SFPO平均得分比GRPO高出最多2.80分。同时，它仅需比GRPO达到最佳精度少4.93次探索，并节省了4.19倍的墙上时钟时间。

English

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 fewer rollouts and a 4.19 reduction in wall-clock time to match GRPO's best accuracy.