大型推理模型中的長度壓縮優化

摘要

大型推理模型（LRMs）已取得顯著成功，然而它們常產生冗長且不必要的推理鏈。我們將此問題的核心歸因於「無效思考」——模型在得出正確答案後，往往會反覆檢查其工作。為解決這一特定效率問題，我們超越效能與效率的一般原則，提出了兩個新的細粒度原則：簡潔性（Brevity），主張消除冗餘；以及充分性（Sufficiency），確保關鍵推理步驟得以保留。基於這些原則，我們引入了LC-R1，這是一種基於群組相對策略優化（GRPO）的訓練後方法。LC-R1創新地結合了用於整體簡潔性的長度獎勵，以及專門設計用於移除思考過程中無效部分的壓縮獎勵。在多個推理基準上的廣泛實驗表明，LC-R1在序列長度上實現了顯著減少（約50%），而準確率僅略有下降（約2%），在帕累托前沿上達到了優先考慮高壓縮的有利平衡點。我們的分析進一步驗證了LC-R1的穩健性，並為開發更強大且計算效率更高的LRMs提供了寶貴見解。我們的代碼已發佈於https://github.com/zxiangx/LC-R1。

English

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

大型推理模型中的長度壓縮優化

Optimizing Length Compression in Large Reasoning Models

摘要

Support