群組序列策略優化
Group Sequence Policy Optimization
July 24, 2025
作者: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
cs.AI
摘要
本文介绍了群组序列策略优化(Group Sequence Policy Optimization, GSPO),这是一种稳定、高效且性能卓越的强化学习算法,专为训练大规模语言模型而设计。与以往采用词元级别重要性比率的算法不同,GSPO基于序列似然性定义重要性比率,并实施序列级别的裁剪、奖励与优化。我们证明,相较于GRPO算法,GSPO在训练效率与性能上均展现出显著优势,特别是在稳定专家混合(Mixture-of-Experts, MoE)强化学习训练方面,并具备简化强化学习基础设施设计的潜力。GSPO的这些优点,为最新Qwen3模型的显著提升做出了重要贡献。
English
This paper introduces Group Sequence Policy Optimization (GSPO), our stable,
efficient, and performant reinforcement learning algorithm for training large
language models. Unlike previous algorithms that adopt token-level importance
ratios, GSPO defines the importance ratio based on sequence likelihood and
performs sequence-level clipping, rewarding, and optimization. We demonstrate
that GSPO achieves superior training efficiency and performance compared to the
GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and
has the potential for simplifying the design of RL infrastructure. These merits
of GSPO have contributed to the remarkable improvements in the latest Qwen3
models.