群体序列策略优化

摘要

本文介绍了群序列策略优化（Group Sequence Policy Optimization, GSPO），这是一种稳定、高效且性能卓越的强化学习算法，专为训练大规模语言模型而设计。与以往采用词元级别重要性比率的算法不同，GSPO基于序列似然性定义重要性比率，并在序列级别进行裁剪、奖励和优化。我们证明，相较于GRPO算法，GSPO在训练效率和性能上均表现出色，显著稳定了专家混合（Mixture-of-Experts, MoE）强化学习训练过程，并具备简化强化学习基础设施设计的潜力。GSPO的这些优势为最新Qwen3模型的显著提升做出了重要贡献。

English

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.