ChatPaper.aiChatPaper

群組序列策略優化

Group Sequence Policy Optimization

July 24, 2025
作者: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin
cs.AI

摘要

本文介绍了群组序列策略优化(Group Sequence Policy Optimization, GSPO),这是一种稳定、高效且性能卓越的强化学习算法,专为训练大规模语言模型而设计。与以往采用词元级别重要性比率的算法不同,GSPO基于序列似然性定义重要性比率,并实施序列级别的裁剪、奖励与优化。我们证明,相较于GRPO算法,GSPO在训练效率与性能上均展现出显著优势,特别是在稳定专家混合(Mixture-of-Experts, MoE)强化学习训练方面,并具备简化强化学习基础设施设计的潜力。GSPO的这些优点,为最新Qwen3模型的显著提升做出了重要贡献。
English
This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.
PDF25715July 25, 2025