ARM: Adaptief Redeneermodel

Samenvatting

Hoewel grote redeneermodellen sterke prestaties laten zien bij complexe taken, missen ze het vermogen om het gebruik van redeneertokens aan te passen op basis van de taakmoeilijkheid. Dit leidt vaak tot het "overdenken"-probleem — overmatig en onnodig redeneren — wat, hoewel mogelijk beperkt door menselijk ingrijpen om het tokenbudget te beheersen, nog steeds fundamenteel in tegenspraak is met het doel om volledig autonome AI te bereiken. In dit werk stellen we het Adaptive Reasoning Model (ARM) voor, een redeneermodel dat in staat is om adaptief geschikte redeneerformats te selecteren op basis van de taak. Deze formats omvatten drie efficiënte — Direct Antwoord, Korte CoT en Code — evenals een uitgebreider format, Lange CoT. Om ARM te trainen, introduceren we Ada-GRPO, een aanpassing van Group Relative Policy Optimization (GRPO), die het format-collapse-probleem in traditionele GRPO aanpakt. Ada-GRPO stelt ARM in staat om een hoge tokenefficiëntie te bereiken, waarbij tokens gemiddeld met 30% en tot wel 70% worden verminderd, terwijl de prestaties vergelijkbaar blijven met het model dat uitsluitend op Lange CoT vertrouwt. Bovendien verbetert het niet alleen de inferentie-efficiëntie door verminderde token-generatie, maar brengt het ook een 2x versnelling in de training. Naast de standaard Adaptieve Modus ondersteunt ARM twee aanvullende redeneermodi: 1) Instructie-Gestuurde Modus, waarmee gebruikers expliciet het redeneerformat kunnen specificeren via speciale tokens — ideaal wanneer het geschikte format bekend is voor een batch taken. 2) Consensus-Gestuurde Modus, die de uitvoer van de drie efficiënte formats aggregeert en terugvalt op Lange CoT bij onenigheid, waarbij prestaties worden geprioriteerd met hoger tokengebruik.

English

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM: Adaptief Redeneermodel

ARM: Adaptive Reasoning Model

Samenvatting

Support