大規模推論モデルにおける長さ圧縮の最適化

要旨

大規模推論モデル（LRMs）は顕著な成功を収めているが、不要で冗長な推論連鎖を生成する傾向がある。この問題の核心として「無効な思考」を特定する。モデルは正しい答えを導出した後も、繰り返し自身の作業を再確認する傾向がある。この特定の非効率性に対処するため、効率性と有効性の一般的な原則を超えて、二つの新しい細分化された原則を提案する。すなわち、冗長性を排除する「簡潔性」と、重要な推論ステップを保持する「十分性」である。これらの原則に基づき、グループ相対ポリシー最適化（GRPO）に基づくポストトレーニング手法であるLC-R1を導入する。LC-R1は、全体的な簡潔性を促進する長さ報酬と、思考プロセスの無効な部分を除去するために特別に設計された圧縮報酬を組み合わせた新たな手法を採用する。複数の推論ベンチマークでの広範な実験により、LC-R1はシーケンス長を約50％削減し、精度の低下はわずか約2％に留まり、高い圧縮を優先するパレートフロンティア上で有利なトレードオフ点を達成することが示された。分析により、LC-R1の堅牢性が検証され、より強力で計算効率の良いLRMsを開発するための貴重な洞察が提供される。コードはhttps://github.com/zxiangx/LC-R1で公開されている。

English

Large Reasoning Models (LRMs) have achieved remarkable success, yet they often suffer from producing unnecessary and verbose reasoning chains. We identify a core aspect of this issue as "invalid thinking" -- models tend to repeatedly double-check their work after having derived the correct answer. To address this specific inefficiency, we move beyond the general principles of Efficacy and Efficiency to propose two new, fine-grained principles: Brevity, which advocates for eliminating redundancy, and Sufficiency, which ensures critical reasoning steps are preserved. Guided by these principles, we introduce LC-R1, a post-training method based on Group Relative Policy Optimization (GRPO). LC-R1 employs a novel combination of a Length Reward for overall conciseness and a Compress Reward that is specifically designed to remove the invalid portion of the thinking process. Extensive experiments on multiple reasoning benchmarks demonstrate that LC-R1 achieves a significant reduction in sequence length (~50%) with only a marginal (~2%) drop in accuracy, achieving a favorable trade-off point on the Pareto frontier that prioritizes high compression. Our analysis further validates the robustness of LC-R1 and provides valuable insights for developing more powerful yet computationally efficient LRMs. Our code is released at https://github.com/zxiangx/LC-R1.

大規模推論モデルにおける長さ圧縮の最適化

Optimizing Length Compression in Large Reasoning Models

要旨

Support