ChatPaper.aiChatPaper

长度无偏序列策略优化:揭示与控制RLVR中的响应长度变异

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

February 5, 2026
作者: Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu
cs.AI

摘要

近期,可验证奖励强化学习(RLVR)在大语言模型(LLM)和视觉语言模型(VLM)中的应用显著提升了复杂任务的推理能力。在RLVR训练过程中,响应长度的增加通常被视为推理能力增长的关键因素。然而,不同RLVR算法在训练期间响应长度的变化模式存在显著差异。为从根本上解释这些差异,本文对主流RLVR算法的构成要素展开深入分析。我们提出了响应长度影响因素的理论分析,并通过大量实验验证了该理论。基于这些理论发现,我们提出了长度无偏序列策略优化(LUSPO)算法。具体而言,我们修正了分组序列策略优化(GSPO)中固有的长度偏差,使其损失函数对响应长度保持无偏性,从而解决了响应长度塌缩问题。我们在数学推理基准和多模态推理场景中进行了广泛实验,LUSPO始终表现出卓越性能。实证结果表明,相较于GRPO、GSPO等现有方法,LUSPO代表了一种新颖的、最先进的优化策略。
English
Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
PDF453February 7, 2026