行前先走！透過強化學習實現簡潔的大型語言模型推理

摘要

隨著測試時縮放成為大型語言模型（LLMs）發展中的關鍵研究前沿，當代先進的後訓練方法日益聚焦於延長長鏈思維（CoT）回應的生成長度，以提升推理能力，達到類似DeepSeek R1的表現。然而，近期研究揭示，在最先進的推理模型中存在持續的過度思考現象，表現為長CoT回應中過多的冗餘或重複思維模式。為解決此問題，本文提出了一個簡單而有效的兩階段強化學習框架，名為ConciseR，旨在實現LLMs中的簡潔推理。具體而言，第一階段使用更多訓練步驟，旨在通過帶有剪裁上限和動態採樣組件的群組相對策略優化（GRPO++）激勵模型的推理能力；第二階段使用較少訓練步驟，通過長度感知的群組相對策略優化（L-GRPO）明確地強制簡潔並提升效率。值得注意的是，ConciseR僅在所有樣本的推演都正確後才優化回應長度，遵循「先走後跑」的原則。大量實驗結果表明，我們的ConciseR模型在生成更簡潔的CoT推理回應方面，超越了近期在AIME 2024、MATH-500、AMC 2023、Minerva和奧林匹亞基準測試中採用零RL範式的最先進推理模型。

English

As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.

行前先走！透過強化學習實現簡潔的大型語言模型推理

Walk Before You Run! Concise LLM Reasoning via Reinforcement Learning

摘要

Support