AdaptThink: 추론 모델은 언제 사고해야 하는지 학습할 수 있다

초록

최근, 대규모 추론 모델들은 인간과 유사한 심층 사고를 활용하여 다양한 과제에서 인상적인 성능을 달성했습니다. 그러나 이러한 긴 사고 과정은 추론 오버헤드를 상당히 증가시켜 효율성이 주요 병목 현상으로 대두되고 있습니다. 본 연구에서는 먼저 비교적 단순한 과제에 대해 사고 과정을 건너뛰고 최종 해결책을 직접 생성하도록 유도하는 NoThinking이 성능과 효율성 측면에서 더 나은 선택임을 입증했습니다. 이를 바탕으로 우리는 문제 난이도에 따라 적응적으로 최적의 사고 모드를 선택하도록 추론 모델을 가르치는 새로운 강화 학습 알고리즘인 AdaptThink을 제안합니다. 구체적으로, AdaptThink은 두 가지 핵심 구성 요소를 특징으로 합니다: (1) 전반적인 성능을 유지하면서 NoThinking을 선택하도록 유도하는 제약 최적화 목표; (2) 온-정책 학습 과정에서 Thinking과 NoThinking 샘플을 균형 있게 조정하는 중요도 샘플링 전략으로, 이를 통해 콜드 스타트를 가능하게 하고 학습 과정 전반에 걸쳐 두 사고 모드를 탐색 및 활용할 수 있도록 합니다. 실험 결과, AdaptThink은 추론 비용을 크게 줄이면서도 성능을 더욱 향상시키는 것으로 나타났습니다. 특히, 세 가지 수학 데이터셋에서 AdaptThink은 DeepSeek-R1-Distill-Qwen-1.5B의 평균 응답 길이를 53% 줄이고 정확도를 2.4% 향상시켜, 추론 품질과 효율성 간의 균형을 최적화하는 적응적 사고 모드 선택의 잠재력을 입증했습니다. 우리의 코드와 모델은 https://github.com/THU-KEG/AdaptThink에서 확인할 수 있습니다.

English

Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

AdaptThink: 추론 모델은 언제 사고해야 하는지 학습할 수 있다

AdaptThink: Reasoning Models Can Learn When to Think

초록

Support