ARM: 適応的推論モデル

要旨

大規模な推論モデルは複雑なタスクにおいて高い性能を発揮するものの、タスクの難易度に基づいて推論トークンの使用量を調整する能力を欠いています。これにより、「過剰思考」問題――過剰で不要な推論――がしばしば発生します。この問題は、人間が介入してトークン予算を制御することで緩和できる可能性があるものの、完全自律型AIの実現という目標とは根本的に矛盾しています。本研究では、Adaptive Reasoning Model (ARM)を提案します。ARMは、タスクに応じて適切な推論形式を適応的に選択できる推論モデルです。これらの形式には、Direct Answer、Short CoT、Codeという3つの効率的な形式と、より詳細な形式であるLong CoTが含まれます。ARMを訓練するために、Group Relative Policy Optimization (GRPO)を改良したAda-GRPOを導入します。Ada-GRPOは、従来のGRPOにおける形式崩壊問題に対処し、ARMが高いトークン効率を達成できるようにします。これにより、平均30%、最大70%のトークン削減を実現しつつ、Long CoTのみに依存するモデルと同等の性能を維持します。さらに、トークン生成量の削減による推論効率の向上に加え、訓練速度も2倍に高速化します。デフォルトのAdaptive Modeに加え、ARMは2つの追加の推論モードをサポートします：1) Instruction-Guided Mode：ユーザーが特殊トークンを通じて推論形式を明示的に指定できるモードで、一連のタスクに対して適切な形式が既知の場合に理想的です。2) Consensus-Guided Mode：3つの効率的な形式の出力を集約し、意見が一致しない場合にLong CoTに頼るモードで、より高いトークン使用量を優先して性能を重視します。

English

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM: 適応的推論モデル

ARM: Adaptive Reasoning Model

要旨

Support