ARM：自适应推理模型

摘要

尽管大型推理模型在复杂任务上展现出强劲性能，它们却无法根据任务难度调整推理令牌的使用。这常常导致“过度思考”问题——即进行过多且不必要的推理——虽然通过人为干预控制令牌预算可能缓解这一问题，但这从根本上与实现完全自主AI的目标相悖。在本研究中，我们提出了自适应推理模型（ARM），一种能够根据当前任务自适应选择合适推理格式的模型。这些格式包括三种高效形式——直接回答、简短链式思维（Short CoT）和代码——以及一种更为详尽的格式，长链式思维（Long CoT）。为了训练ARM，我们引入了Ada-GRPO，这是对群体相对策略优化（GRPO）的一种改进，解决了传统GRPO中的格式崩溃问题。Ada-GRPO使ARM实现了高令牌效率，平均减少30%的令牌使用，最高可达70%，同时保持与仅依赖Long CoT的模型相当的性能。此外，它不仅通过减少令牌生成提高了推理效率，还带来了训练速度的2倍提升。除了默认的自适应模式外，ARM还支持两种额外的推理模式：1）指令引导模式，允许用户通过特殊令牌明确指定推理格式——当已知一批任务的合适格式时，此模式尤为理想。2）共识引导模式，该模式汇总三种高效格式的输出，并在出现分歧时采用Long CoT，优先考虑性能，尽管会使用更多令牌。

English

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM：自适应推理模型

ARM: Adaptive Reasoning Model

摘要

Support