언제 사고를 계속할 것인가: 효율적 추론을 위한 적응적 사고 모드 전환

초록

대형 추론 모델(LRMs)은 긴 추론 체인을 통해 뛰어난 성능을 달성하지만, 특히 간단한 작업에서 불필요한 추론으로 인해 과도한 계산 오버헤드가 발생하는 경우가 많습니다. 본 연구에서는 LRMs의 상한선을 Long-Thinking 및 No-Thinking 모드에서 체계적으로 정량화하고, 답변 생성 과정에서 모델이 암묵적으로 추론을 보완하는 "내부 자가 회복 메커니즘" 현상을 발견했습니다. 이러한 통찰을 바탕으로, 불필요한 추론을 억제하고 암묵적 회복을 가능하게 하는 적응형 자가 회복 추론(ASRR) 프레임워크를 제안합니다. 정확도 인식 길이 보상 규제를 도입함으로써, ASRR은 문제의 난이도에 따라 추론 노력을 적응적으로 할당하여 최소한의 성능 손실로 높은 효율성을 달성합니다. 다양한 벤치마크와 모델에서의 실험 결과, ASRR은 GRPO 대비 추론 예산을 최대 32.5%(1.5B) 및 25.7%(7B)까지 줄이면서도 최소한의 정확도 손실(1.2% 및 0.6% pass@1)을 보였으며, 안전성 벤치마크에서 무해율을 크게 향상시켰습니다(최대 +21.7%). 이러한 결과는 ASRR이 LRMs에서 효율적이고 적응적이며 더 안전한 추론을 가능하게 할 잠재력을 강조합니다.

English

Large reasoning models (LRMs) achieve remarkable performance via long reasoning chains, but often incur excessive computational overhead due to redundant reasoning, especially on simple tasks. In this work, we systematically quantify the upper bounds of LRMs under both Long-Thinking and No-Thinking modes, and uncover the phenomenon of "Internal Self-Recovery Mechanism" where models implicitly supplement reasoning during answer generation. Building on this insight, we propose Adaptive Self-Recovery Reasoning (ASRR), a framework that suppresses unnecessary reasoning and enables implicit recovery. By introducing accuracy-aware length reward regulation, ASRR adaptively allocates reasoning effort according to problem difficulty, achieving high efficiency with negligible performance sacrifice. Experiments across multiple benchmarks and models show that, compared with GRPO, ASRR reduces reasoning budget by up to 32.5% (1.5B) and 25.7% (7B) with minimal accuracy loss (1.2% and 0.6% pass@1), and significantly boosts harmless rates on safety benchmarks (up to +21.7%). Our results highlight the potential of ASRR for enabling efficient, adaptive, and safer reasoning in LRMs.

언제 사고를 계속할 것인가: 효율적 추론을 위한 적응적 사고 모드 전환

When to Continue Thinking: Adaptive Thinking Mode Switching for Efficient Reasoning

초록

Support