効率的な推論の技法：データ、報酬、最適化

要旨

大規模言語モデル（LLM）は、スケール化された連鎖思考（CoT）推論から一貫して恩恵を受けているが、同時に大きな計算コストも負っている。この問題に対処するため、効率的な推論は、一般的に強化学習（RL）を用いた報酬形成を通じて、短くかつ正確な思考軌道を促進することを目指す。本論文では、LLMの効率的な推論のメカニズムを体系的に調査する。包括的評価のために、正答率に条件付けられた思考長分布や、2kから32kまでの幅広いトークン予算スペクトルにおける性能など、より細かい指標の採用を提唱する。まず、学習プロセスが「長さ適応」と「推論洗練」という二段階のパラダイムに従うことを明らかにする。その後、統一されたプロトコル下で（約20万GPU時間に及ぶ）大規模な実験を行い、学習プロンプトとロールアウト、報酬形成、最適化戦略を分解して分析する。特に重要な発見は、比較的容易なプロンプトで学習を行うことで、正の報酬信号の密度を確保し、思考長の崩壊を回避できる点である。同時に、学習された長さバイアスは分野間で汎化可能である。我々は全ての発見を貴重な知見と実践的な指針に集約し、さらに0.6Bから30BまでのQwen3シリーズ全体でそれらを検証し、頑健性と汎化性を実証する。

English

Large Language Models (LLMs) consistently benefit from scaled Chain-of-Thought (CoT) reasoning, but also suffer from heavy computational overhead. To address this issue, efficient reasoning aims to incentivize short yet accurate thinking trajectories, typically through reward shaping with Reinforcement Learning (RL). In this paper, we systematically investigate the mechanics of efficient reasoning for LLMs. For comprehensive evaluation, we advocate for more fine-grained metrics, including length distribution conditioned on correctness and performance across a wide spectrum of token budgets ranging from 2k to 32k. First, we reveal that the training process follows a two-stage paradigm: length adaptation and reasoning refinement. After that, we conduct extensive experiments (about 0.2 million GPU hours) in a unified protocol, deconstructing training prompts and rollouts, reward shaping, and optimization strategies. In particular, a key finding is to train on relatively easier prompts, ensuring the density of positive reward signals and thus avoiding the length collapse. Meanwhile, the learned length bias can be generalized across domains. We distill all findings into valuable insights and practical guidelines, and further validate them across the Qwen3 series, ranging from 0.6B to 30B, demonstrating the robustness and generalization.

効率的な推論の技法：データ、報酬、最適化

The Art of Efficient Reasoning: Data, Reward, and Optimization

要旨

Support