完璧より完了が重要：構造化された多段階分解による効率的な推論の実現

要旨

大規模推論モデル（LRM）は、最終的な答えを導き出すための過度に長い連鎖思考（CoT）が批判されており、最初のトークンおよび全体のレイテンシが高いという問題を抱えています。通常、LRMのCoTは複数の思考ユニットを混在させており、各ユニットは元のクエリに対する候補となる答えを生成しようとします。したがって、効率を向上させるための自然なアイデアは、ユニットの数を減らすことです。しかし、従来のCoTにおける思考ユニットを明示的に管理できないという事実が、これを困難にしています。本論文では、従来のCoTを明示的で構造化されたターンごとの相互作用のシーケンスにデコードするためのマルチターン分解（MinD）を導入し、このギャップを埋めます。MinDでは、モデルはクエリに対してマルチターンの応答を提供し、各ターンは思考ユニットを包含し、対応する答えを生成します。後続のターンでは、以前の思考部分や答え部分を反映、検証、修正、または代替アプローチを探ることができます。これにより、答えがより迅速に提供されるだけでなく、反復的な推論プロセスに対する明示的な制御が可能になります（つまり、ユーザーは任意のターンで停止または継続することができます）。MinDを実現するために、教師ありファインチューニング（SFT）と強化学習（RL）のパラダイムに従います。まず、別のLLMにプロンプトを送ることでLRMの出力をマルチターン形式に言い換え、そのようなデータでLRMをチューニングします。チューニングされたモデルが元のモデルよりもさらに多くのトークンを消費する傾向があることを観察し（おそらくマルチターン形式が追加の答えトークンを導入するため）、GRPOのようなRLアルゴリズムを活用して、より少ないターンで正しい出力を優先することを提唱します。R1-Distillモデルを使用してMATHデータセットでトレーニングされたMinDは、出力トークンの使用量と最初のトークンまでの時間（TTFT）を最大約70％削減しつつ、MATH-500、AIME24、AMC23、GPQA-Diamondなどの推論ベンチマークで競争力のある性能を維持することができます。

English

Large Reasoning Models (LRMs) are criticized for the excessively lengthy Chain-of-Thought (CoT) to derive the final answer, suffering from high first-token and overall latency. Typically, the CoT of LRMs mixes multiple thinking units; each unit attempts to produce a candidate answer to the original query. Hence, a natural idea to improve efficiency is to reduce the unit number. Yet, the fact that the thinking units in vanilla CoT cannot be explicitly managed renders doing so challenging. This paper introduces Multi-Turn Decomposition (MinD) to decode conventional CoT into a sequence of explicit, structured, and turn-wise interactions to bridge the gap. In MinD, the model provides a multi-turn response to the query, where each turn embraces a thinking unit and yields a corresponding answer. The subsequent turns can reflect, verify, revise, or explore alternative approaches to both the thinking and answer parts of earlier ones. This not only makes the answer delivered more swiftly, but also enables explicit controls over the iterative reasoning process (i.e., users may halt or continue at any turn). We follow a supervised fine-tuning (SFT) then reinforcement learning (RL) paradigm to realize MinD. We first rephrase the outputs of an LRM into multi-turn formats by prompting another LLM, and then tune the LRM with such data. Observing that the tuned model tends to consume even more tokens than the original one (probably due to that the multi-turn formats introduce additional answer tokens), we advocate leveraging RL algorithms like GRPO to prioritize correct outputs with fewer turns. Trained on the MATH dataset using R1-Distill models, MinD can achieve up to ~70% reduction in both output token usage and time to first token (TTFT), while maintaining competitive performance on reasoning benchmarks such as MATH-500, AIME24, AMC23, and GPQA-Diamond.

完璧より完了が重要：構造化された多段階分解による効率的な推論の実現

Done Is Better than Perfect: Unlocking Efficient Reasoning by Structured Multi-Turn Decomposition

要旨

Support