高效推理的艺术：数据、奖励与优化

摘要

大型语言模型（LLMs）持续受益于规模化的思维链推理，但也承受着沉重的计算开销。为解决这一问题，高效推理技术旨在通过强化学习的奖励塑造机制，激励模型生成简短而准确的思维轨迹。本文系统性地研究了LLMs高效推理的内在机制。为全面评估，我们提出采用更细粒度的指标，包括基于正确性的长度分布分析，以及在2k至32k令牌预算范围内的性能表现。首先，我们发现训练过程遵循两阶段范式：长度适应与推理优化。随后，我们在统一实验框架下开展大规模实验（累计约20万GPU小时），解构了训练提示词与推演过程、奖励函数设计及优化策略。关键发现表明：在相对简单的提示词上训练可确保正向奖励信号的密度，从而避免长度塌缩问题。同时，习得的长度偏好具备跨领域泛化能力。我们将所有发现提炼为具有实践价值的指导原则，并在Qwen3系列模型（0.6B至30B参数规模）上进行验证，证明了其鲁棒性与泛化能力。

English

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.