ARM: 적응형 추론 모델

초록

대규모 추론 모델은 복잡한 작업에서 강력한 성능을 보여주지만, 작업 난이도에 따라 추론 토큰 사용량을 조정하는 능력이 부족합니다. 이로 인해 종종 "과도한 사고(overthinking)" 문제가 발생하는데, 이는 과도하고 불필요한 추론을 의미하며, 인간의 개입을 통해 토큰 예산을 통제함으로써 완화될 수는 있지만, 여전히 완전 자율적인 AI를 달성하려는 목표와 근본적으로 상충됩니다. 본 연구에서는 작업에 따라 적절한 추론 형식을 적응적으로 선택할 수 있는 Adaptive Reasoning Model(ARM)을 제안합니다. 이러한 형식에는 Direct Answer, Short CoT, Code와 같은 효율적인 세 가지 형식과 더 상세한 형식인 Long CoT가 포함됩니다. ARM을 학습시키기 위해, 기존 Group Relative Policy Optimization(GRPO)의 형식 붕괴 문제를 해결한 Ada-GRPO를 도입했습니다. Ada-GRPO는 ARM이 Long CoT에만 의존하는 모델과 비슷한 성능을 유지하면서도 평균 30%, 최대 70%까지 토큰 사용량을 줄이는 높은 토큰 효율성을 달성할 수 있게 합니다. 또한, 토큰 생성량 감소를 통해 추론 효율성을 개선할 뿐만 아니라, 학습 속도도 2배 가속화합니다. 기본적인 Adaptive Mode 외에도 ARM은 두 가지 추가 추론 모드를 지원합니다: 1) Instruction-Guided Mode: 사용자가 특수 토큰을 통해 추론 형식을 명시적으로 지정할 수 있도록 하며, 일괄 작업에 적합한 형식을 알고 있을 때 이상적입니다. 2) Consensus-Guided Mode: 세 가지 효율적인 형식의 출력을 집계하고, 불일치가 발생할 경우 Long CoT를 사용하여 더 높은 토큰 사용량을 감수하면서 성능을 우선시합니다.

English

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM: 적응형 추론 모델

ARM: Adaptive Reasoning Model

초록

Support