學會透過自適應長度獎勵塑形進行高效推理

摘要

大型推理模型（LRMs）通過強化學習（RL）在解決複雜問題方面展現了顯著能力，尤其是在生成長推理軌跡方面。然而，這些冗長的輸出往往存在大量冗餘，限制了LRMs的效率。本文探討了基於RL的方法來提升推理效率。具體而言，我們首先提出了一個統一框架，通過基於長度的獎勵塑造來形式化各種高效推理方法。基於這一視角，我們提出了一種新穎的基於長度的步進獎勵塑造方法（LASER），該方法採用由目標長度控制的步進函數作為獎勵。LASER超越了先前的方法，在性能和效率之間實現了更優的帕累托最優平衡。接著，我們基於兩個關鍵直覺進一步擴展了LASER：（1）模型的推理行為在訓練過程中不斷演變，因此需要獎勵規範也是自適應和動態的；（2）與其統一鼓勵更短或更長的思維鏈（CoT），我們認為基於長度的獎勵塑造應具備難度感知能力，即對於簡單查詢應更嚴厲地懲罰冗長的CoT。這種方法有望促進快慢思維的結合，從而實現更好的整體權衡。由此產生的方法被稱為LASER-D（動態且難度感知）。在DeepSeek-R1-Distill-Qwen-1.5B、DeepSeek-R1-Distill-Qwen-7B和DeepSeek-R1-Distill-Qwen-32B上的實驗表明，我們的方法顯著提升了推理性能和響應長度效率。例如，LASER-D及其變體在AIME2024上實現了+6.1的提升，同時減少了63%的token使用。進一步分析顯示，我們基於RL的壓縮產生了更簡潔的推理模式，減少了冗餘的“自我反思”。相關資源請訪問https://github.com/hkust-nlp/Laser。

English

Large Reasoning Models (LRMs) have shown remarkable capabilities in solving complex problems through reinforcement learning (RL), particularly by generating long reasoning traces. However, these extended outputs often exhibit substantial redundancy, which limits the efficiency of LRMs. In this paper, we investigate RL-based approaches to promote reasoning efficiency. Specifically, we first present a unified framework that formulates various efficient reasoning methods through the lens of length-based reward shaping. Building on this perspective, we propose a novel Length-bAsed StEp Reward shaping method (LASER), which employs a step function as the reward, controlled by a target length. LASER surpasses previous methods, achieving a superior Pareto-optimal balance between performance and efficiency. Next, we further extend LASER based on two key intuitions: (1) The reasoning behavior of the model evolves during training, necessitating reward specifications that are also adaptive and dynamic; (2) Rather than uniformly encouraging shorter or longer chains of thought (CoT), we posit that length-based reward shaping should be difficulty-aware i.e., it should penalize lengthy CoTs more for easy queries. This approach is expected to facilitate a combination of fast and slow thinking, leading to a better overall tradeoff. The resulting method is termed LASER-D (Dynamic and Difficulty-aware). Experiments on DeepSeek-R1-Distill-Qwen-1.5B, DeepSeek-R1-Distill-Qwen-7B, and DeepSeek-R1-Distill-Qwen-32B show that our approach significantly enhances both reasoning performance and response length efficiency. For instance, LASER-D and its variant achieve a +6.1 improvement on AIME2024 while reducing token usage by 63%. Further analysis reveals our RL-based compression produces more concise reasoning patterns with less redundant "self-reflections". Resources are at https://github.com/hkust-nlp/Laser.

學會透過自適應長度獎勵塑形進行高效推理

Learn to Reason Efficiently with Adaptive Length-based Reward Shaping

摘要

Support