慢速-快速策略优化：大語言模型推理中的更新前重定位

摘要

强化学习（Reinforcement Learning, RL）在提升大型语言模型（Large Language Models, LLMs）的推理能力方面已占据核心地位。然而，诸如群体相对策略优化（Group Relative Policy Optimization, GRPO）等在线策略算法在训练初期常面临困境：由低质量轨迹产生的噪声梯度导致更新不稳定及探索效率低下。为此，我们提出了慢快策略优化（Slow-Fast Policy Optimization, SFPO），这一简洁而高效的框架通过将每一步分解为三个阶段来应对上述局限：在同一批次内进行短而快的内部步骤轨迹、控制离策略漂移的重新定位机制，以及最终的慢速校正。这种“先定位后更新”的设计保持了目标函数与轨迹过程不变，使得SFPO能够无缝兼容现有的策略梯度流程。大量实验表明，SFPO在稳定性提升、轨迹减少及推理RL训练收敛加速方面表现一致。特别是在数学推理基准测试中，SFPO较GRPO平均高出2.80分。同时，在达到GRPO最佳准确率时，SFPO实现了最多减少4.93次轨迹及4.19倍的墙上时间缩短。

English

Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93 fewer rollouts and a 4.19 reduction in wall-clock time to match GRPO's best accuracy.

慢速-快速策略优化：大語言模型推理中的更新前重定位

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

摘要

Support