Control-R: 제어 가능한 테스트 타임 스케일링을 향하여

초록

본 논문은 대규모 추론 모델(Large Reasoning Models, LRMs)의 긴 사고 연쇄(Long Chain-of-Thought, CoT) 추론에서 발생하는 과소사고(underthinking)와 과잉사고(overthinking) 문제를 해결하기 위해, 구조화된 제어 신호를 주입하여 트리 탐색 관점에서 추론을 안내하는 새로운 테스트 시간 접근법인 추론 제어 필드(Reasoning Control Fields, RCF)를 제안합니다. RCF는 복잡한 과제를 해결할 때 주어진 제어 조건에 따라 추론 노력을 조정할 수 있도록 모델을 지원합니다. 또한, 상세한 추론 과정과 해당 제어 필드가 주석 처리된 도전적인 문제들로 구성된 Control-R-4K 데이터셋을 소개합니다. 추론 제어를 더욱 강화하기 위해, 특히 Control-R-32B 모델을 테스트 시간 동안 효과적으로 추론 노력을 조정하도록 훈련시키는 조건부 증류 미세조정(Conditional Distillation Finetuning, CDF) 방법을 제안합니다. AIME2024 및 MATH500과 같은 벤치마크에서의 실험 결과는 우리의 접근법이 32B 규모에서 최첨단 성능을 달성하면서도 제어 가능한 긴 사고 연쇄 추론 과정(L-CoT)을 가능하게 함을 보여줍니다. 전반적으로, 이 연구는 테스트 시간 동안 확장 가능한 추론을 제어할 수 있는 효과적인 패러다임을 제시합니다.

English

This paper target in addressing the challenges of underthinking and overthinking in long chain-of-thought (CoT) reasoning for Large Reasoning Models (LRMs) by introducing Reasoning Control Fields (RCF)--a novel test-time approach that injects structured control signals to guide reasoning from a tree search perspective. RCF enables models to adjust reasoning effort according to given control conditions when solving complex tasks. Additionally, we present the Control-R-4K dataset, which consists of challenging problems annotated with detailed reasoning processes and corresponding control fields. To further enhance reasoning control, we propose a Conditional Distillation Finetuning (CDF) method, which trains model--particularly Control-R-32B--to effectively adjust reasoning effort during test time. Experimental results on benchmarks such as AIME2024 and MATH500 demonstrate that our approach achieves state-of-the-art performance at the 32B scale while enabling a controllable Long CoT reasoning process (L-CoT). Overall, this work introduces an effective paradigm for controllable test-time scaling reasoning.