LSPO：面向大语言模型推理的基于长度感知动态采样的策略优化

摘要

自Deepseek-R1发布以来，基于可验证奖励的强化学习（RLVR）已成为训练大型语言模型（LLMs）执行推理任务的核心方法。近期研究主要集中于改进损失函数，以提升RLVR的效率和效果。本文受LLMs中过度思考现象研究的启发，提出了一种新颖的元RLVR算法——基于长度感知的策略优化采样（LSPO），该算法依据平均响应长度动态选择每一步的训练数据。我们在多种基础模型和数据集上对LSPO进行了评估，结果表明其持续提升了学习效能。此外，我们还开展了一项详细的消融研究，探讨了将长度信号融入动态采样的其他方式，为未来研究提供了深入见解并指明了有前景的方向。

English

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

LSPO：面向大语言模型推理的基于长度感知动态采样的策略优化

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

摘要

Support