软自适应策略优化
Soft Adaptive Policy Optimization
November 25, 2025
作者: Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin
cs.AI
摘要
强化学习(RL)在提升大语言模型(LLM)推理能力方面日益重要,但稳定且高效的策略优化仍面临挑战。词元级别的重要性比率常呈现高方差现象——这一现象在混合专家(Mixture-of-Experts)模型中尤为突出——导致策略更新不稳定。现有的基于分组的策略优化方法(如GSPO和GRPO)通过硬截断缓解该问题,但难以同时保持稳定性与有效学习。我们提出软自适应策略优化(SAPO),采用平滑的温度控制门替代硬截断,在保留有效学习信号的同时自适应地衰减离策略更新。相较于GSPO与GRPO,SAPO兼具序列连贯性与词元自适应性。与GSPO类似,SAPO保持序列级别的连贯性,但其软门控形成连续信任区域,避免了GSPO中脆性的硬截断带。当序列中出现少量高度离策略词元时,GSPO会抑制整个序列的梯度,而SAPO仅选择性削弱异常词元权重,保留近策略词元的学习信号,从而提升样本效率。相较于GRPO,SAPO以平滑的温度控制缩放替代硬词元截断,实现更具信息量与稳定的更新。数学推理基准测试表明,在相同训练预算下,SAPO展现出更优的训练稳定性与更高Pass@1性能。此外,我们应用SAPO训练Qwen3-VL模型系列,证明该方法在不同任务和模型规模下均能带来持续性能提升。总体而言,SAPO为LLM的强化学习训练提供了更可靠、可扩展且高效的优化策略。
English
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.