그룹 시퀀스 정책 최적화

초록

본 논문에서는 대규모 언어 모델 훈련을 위한 안정적이고 효율적이며 성능이 뛰어난 강화 학습 알고리즘인 그룹 시퀀스 정책 최적화(Group Sequence Policy Optimization, GSPO)를 소개한다. 토큰 수준의 중요도 비율을 채택한 기존 알고리즘과 달리, GSPO는 시퀀스 가능성에 기반하여 중요도 비율을 정의하고 시퀀스 수준의 클리핑, 보상, 최적화를 수행한다. 본 연구는 GSPO가 GRPO 알고리즘에 비해 우수한 훈련 효율성과 성능을 달성하며, 특히 Mixture-of-Experts(MoE) 강화 학습 훈련을 안정화하고, 강화 학습 인프라 설계를 단순화할 잠재력을 가지고 있음을 입증한다. 이러한 GSPO의 장점은 최신 Qwen3 모델의 뛰어난 성능 개선에 기여하였다.

English

This paper introduces Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant reinforcement learning algorithm for training large language models. Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization. We demonstrate that GSPO achieves superior training efficiency and performance compared to the GRPO algorithm, notably stabilizes Mixture-of-Experts (MoE) RL training, and has the potential for simplifying the design of RL infrastructure. These merits of GSPO have contributed to the remarkable improvements in the latest Qwen3 models.