ChatPaper.aiChatPaper

長度無偏序列策略優化:揭示與控制強化學習對齊方法中的回應長度變異

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

February 5, 2026
作者: Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu
cs.AI

摘要

近期,可驗證獎勵強化學習(RLVR)在大型語言模型(LLMs)與視覺語言模型(VLMs)中的應用,已顯著提升複雜任務的推理能力並取得重大成果。在RLVR訓練過程中,回應長度的增加常被視為推動推理能力增長的關鍵因素。然而,不同RLVR演算法在訓練期間的回應長度變化模式存在顯著差異。為從根本上解釋這些差異,本文對主流RLVR演算法的構成要素進行深入剖析,提出影響回應長度的理論分析,並透過大量實驗驗證該理論。基於上述理論發現,我們提出長度無偏序列策略優化(LUSPO)演算法。具體而言,我們修正群組序列策略優化(GSPO)中固有的長度偏差,使其損失函數對回應長度保持無偏性,從而解決回應長度塌陷問題。我們在數學推理基準與多模態推理場景中進行廣泛實驗,結果顯示LUSPO始終保持卓越性能。實證研究表明,與GRPO、GSPO等現有方法相比,LUSPO代表一種創新且最先進的優化策略。
English
Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
PDF453February 7, 2026