走る前に歩け！強化学習による簡潔なLLM推論

要旨

大規模言語モデル（LLM）の開発において、テスト時のスケーリングが重要な研究フロンティアとなる中、現代の高度なポストトレーニング手法は、長いChain-of-Thought（CoT）応答の生成長を拡張し、DeepSeek R1のような性能に向けた推論能力を向上させることに焦点を当てています。しかし、最新の研究では、最先端の推論モデルにおいて、過剰な冗長性や反復的な思考パターンが長いCoT応答に現れる「過剰思考」現象が持続していることが明らかになりました。この問題に対処するため、本論文では、LLMにおける簡潔な推論を実現するためのシンプルかつ効果的な2段階強化学習フレームワーク「ConciseR」を提案します。具体的には、第1段階では、より多くのトレーニングステップを使用し、clip-higherおよび動的サンプリングコンポーネントを備えたGroup Relative Policy Optimization（GRPO++）を通じてモデルの推論能力を促進します。第2段階では、より少ないトレーニングステップを使用し、Length-aware Group Relative Policy Optimization（L-GRPO）を通じて明示的に簡潔さを強化し効率を向上させます。特に、ConciseRは「歩いてから走る」原則に従い、サンプルのすべてのロールアウトが正しい場合にのみ応答長を最適化します。広範な実験結果は、より簡潔なCoT推論応答を生成するConciseRモデルが、AIME 2024、MATH-500、AMC 2023、Minerva、およびOlympiadベンチマークにおいて、ゼロRLパラダイムを用いた最近の最先端推論モデルを凌駕することを示しています。

English

As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.

走る前に歩け！強化学習による簡潔なLLM推論

Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

要旨

Support