ARM：自適應推理模型

摘要

儘管大型推理模型在複雜任務上展現出強大的性能，它們卻缺乏根據任務難度調整推理令牌使用的能力。這往往導致「過度思考」問題——即過多且不必要的推理——雖然可以通過人為干預來控制令牌預算以緩解此問題，但這從根本上與實現完全自主AI的目標相悖。在本研究中，我們提出了自適應推理模型（Adaptive Reasoning Model, ARM），這是一種能夠根據當前任務自適應選擇合適推理格式的推理模型。這些格式包括三種高效格式——直接回答（Direct Answer）、簡短思維鏈（Short CoT）和代碼（Code）——以及一種更為詳盡的格式，長思維鏈（Long CoT）。為了訓練ARM，我們引入了Ada-GRPO，這是對群組相對策略優化（Group Relative Policy Optimization, GRPO）的改進，解決了傳統GRPO中的格式崩潰問題。Ada-GRPO使ARM能夠實現高令牌效率，平均減少30%的令牌使用，最高可達70%，同時保持與僅依賴長思維鏈的模型相當的性能。此外，它不僅通過減少令牌生成提高了推理效率，還使訓練速度提升了2倍。除了默認的自適應模式外，ARM還支持兩種額外的推理模式：1）指令引導模式（Instruction-Guided Mode），允許用戶通過特殊令牌明確指定推理格式——這在已知一批任務的合適格式時非常理想。2）共識引導模式（Consensus-Guided Mode），它聚合三種高效格式的輸出，並在出現分歧時轉向長思維鏈，優先考慮性能但伴隨更高的令牌使用。

English

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM：自適應推理模型

ARM: Adaptive Reasoning Model

摘要

Support