行稳致远！通过强化学习实现简洁的大型语言模型推理

摘要

随着测试时扩展成为大型语言模型（LLMs）发展的关键研究前沿，当前及先进的训练后方法日益聚焦于延长长链思维（CoT）响应的生成长度，以提升推理能力，使之接近DeepSeek R1的水平。然而，最新研究揭示了顶尖推理模型中持续存在的过度思考现象，表现为长CoT响应中过多的冗余或重复思维模式。针对这一问题，本文提出了一种简单而有效的两阶段强化学习框架，名为ConciseR，旨在实现LLMs中的简洁推理。具体而言，第一阶段通过更多训练步骤，利用带有剪辑上限和动态采样组件的群体相对策略优化（GRPO++），激励模型的推理能力；第二阶段则通过较少训练步骤，采用长度感知的群体相对策略优化（L-GRPO），明确强制简洁性并提升效率。值得注意的是，ConciseR仅在样本的所有推演均正确后，遵循“先走再跑”的原则，对响应长度进行优化。大量实验结果表明，我们的ConciseR模型在生成更为简洁的CoT推理响应方面，超越了采用零RL范式的最新顶尖推理模型，在AIME 2024、MATH-500、AMC 2023、Minerva及奥林匹克竞赛基准测试中均表现出色。

English

As test-time scaling becomes a pivotal research frontier in Large Language Models (LLMs) development, contemporary and advanced post-training methodologies increasingly focus on extending the generation length of long Chain-of-Thought (CoT) responses to enhance reasoning capabilities toward DeepSeek R1-like performance. However, recent studies reveal a persistent overthinking phenomenon in state-of-the-art reasoning models, manifesting as excessive redundancy or repetitive thinking patterns in long CoT responses. To address this issue, in this paper, we propose a simple yet effective two-stage reinforcement learning framework for achieving concise reasoning in LLMs, named ConciseR. Specifically, the first stage, using more training steps, aims to incentivize the model's reasoning capabilities via Group Relative Policy Optimization with clip-higher and dynamic sampling components (GRPO++), and the second stage, using fewer training steps, explicitly enforces conciseness and improves efficiency via Length-aware Group Relative Policy Optimization (L-GRPO). Significantly, ConciseR only optimizes response length once all rollouts of a sample are correct, following the "walk before you run" principle. Extensive experimental results demonstrate that our ConciseR model, which generates more concise CoT reasoning responses, outperforms recent state-of-the-art reasoning models with zero RL paradigm across AIME 2024, MATH-500, AMC 2023, Minerva, and Olympiad benchmarks.