ChatPaper.aiChatPaper

軟性自適應策略優化

Soft Adaptive Policy Optimization

November 25, 2025
作者: Chang Gao, Chujie Zheng, Xiong-Hui Chen, Kai Dang, Shixuan Liu, Bowen Yu, An Yang, Shuai Bai, Jingren Zhou, Junyang Lin
cs.AI

摘要

強化學習(RL)在提升大型語言模型(LLM)推理能力方面扮演著日益重要的角色,然而穩定且高效的策略優化仍具挑戰性。詞元級重要性比率常呈現高方差——此現象在專家混合模型中更為顯著——導致更新不穩定。現有的基於分組的策略優化方法(如GSPO和GRPO)通過硬截斷緩解此問題,但難以同時維持穩定性與有效學習。我們提出軟性自適應策略優化(SAPO),以平滑的溫度控制門機制取代硬截斷,能自適應衰減離策略更新的同時保留有用的學習信號。相較於GSPO與GRPO,SAPO兼具序列連貫性與詞元自適應性:與GSPO類似,SAPO保持序列層級的連貫性,但其軟門控形成連續信任區域,避免了GSPO中脆弱的硬截斷帶。當序列包含少數高度離策略詞元時,GSPO會抑制該序列所有梯度,而SAPO僅選擇性降低異常詞元權重,保留近策略詞元的學習信號,從而提升樣本效率。相對於GRPO,SAPO以平滑的溫度控制縮放取代硬詞元截斷,實現更具信息量且穩定的更新。數學推理基準測試的實證結果表明,在可比訓練成本下,SAPO展現出更優的訓練穩定性與更高的Pass@1性能。此外,我們應用SAPO訓練Qwen3-VL模型系列,證明其能在多樣化任務與不同模型規模中帶來一致的性能提升。總體而言,SAPO為LLM的強化學習訓練提供了更可靠、可擴展且高效的優化策略。
English
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains challenging. Token-level importance ratios often exhibit high variance-a phenomenon exacerbated in Mixture-of-Experts models-leading to unstable updates. Existing group-based policy optimization methods, such as GSPO and GRPO, alleviate this problem via hard clipping, making it difficult to maintain both stability and effective learning. We propose Soft Adaptive Policy Optimization (SAPO), which replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO. When a sequence contains a few highly off-policy tokens, GSPO suppresses all gradients for that sequence, whereas SAPO selectively down-weights only the offending tokens and preserves the learning signal from the near-on-policy ones, improving sample efficiency. Relative to GRPO, SAPO replaces hard token-level clipping with smooth, temperature-controlled scaling, enabling more informative and stable updates. Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes. Overall, SAPO provides a more reliable, scalable, and effective optimization strategy for RL training of LLMs.
PDF313December 1, 2025