ChatPaper.aiChatPaper

通过自适应长度奖励塑形实现高效推理学习

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

May 21, 2025
作者: Wei Liu, Ruochen Zhou, Yiyun Deng, Yuzhen Huang, Junteng Liu, Yuntian Deng, Yizhe Zhang, Junxian He
cs.AI

摘要

大型推理模型(LRMs)通过强化学习(RL)在解决复杂问题方面展现了显著能力,尤其是通过生成长推理轨迹。然而,这些冗长的输出往往存在大量冗余,限制了LRMs的效率。本文探讨了基于RL的方法以提升推理效率。具体而言,我们首先提出了一个统一框架,通过基于长度的奖励塑造视角来形式化各种高效推理方法。基于这一视角,我们提出了一种新颖的基于长度的步进奖励塑造方法(LASER),该方法采用由目标长度控制的步函数作为奖励。LASER超越了以往方法,在性能与效率之间实现了更优的帕累托最优平衡。接着,我们基于两个关键直觉进一步扩展了LASER:(1)模型的推理行为在训练过程中不断演变,需要奖励规范也具备适应性和动态性;(2)与其统一鼓励更短或更长的思维链(CoT),我们认为基于长度的奖励塑造应具备难度感知能力,即对于简单查询,应更严厉地惩罚过长的CoT。这一方法有望促进快慢思维的结合,实现更好的整体权衡。由此产生的被称为LASER-D(动态且难度感知)的方法。在DeepSeek-R1-Distill-Qwen-1.5B、DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Qwen-32B上的实验表明,我们的方法显著提升了推理性能和响应长度效率。例如,LASER-D及其变体在AIME2024上实现了+6.1的提升,同时减少了63%的令牌使用。进一步分析显示,我们基于RL的压缩产生了更简洁的推理模式,减少了冗余的“自我反思”。相关资源请访问https://github.com/hkust-nlp/Laser。
English
Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.

Summary

AI-Generated Summary

PDF281May 22, 2025