START: 도구를 활용한 자기 주도적 추론기

초록

OpenAI-o1 및 DeepSeek-R1과 같은 대형 추론 모델(LRMs)은 긴 사고의 연쇄(Chain-of-thought, CoT)를 활용하여 복잡한 추론 작업에서 뛰어난 능력을 보여주었습니다. 그러나 이러한 모델들은 내부 추론 프로세스에만 의존함으로써 환각(hallucination)과 비효율성 문제를 겪는 경우가 많습니다. 본 논문에서는 외부 도구를 활용하여 추론 능력을 크게 향상시킨 새로운 도구 통합형 장기 CoT 추론 대형 언어 모델(LLM)인 START(Self-Taught Reasoner with Tools)를 소개합니다. 코드 실행을 통해 START는 복잡한 계산 수행, 자체 점검, 다양한 방법 탐색, 그리고 자체 디버깅을 할 수 있어 LRMs의 한계를 극복합니다. START의 핵심 혁신은 두 가지 주요 기술로 구성된 자가 학습 프레임워크에 있습니다: 1) Hint-infer: 추론 과정에서 인공적으로 설계된 힌트(예: "잠깐, 여기서 Python을 사용하는 것이 좋을지도 모르겠다.")를 삽입함으로써 LRM이 데모 데이터 없이도 외부 도구를 활용하는 능력을 효과적으로 자극할 수 있음을 보여줍니다. Hint-infer는 또한 간단하면서도 효과적인 순차적 테스트 시간 스케일링 방법으로도 사용될 수 있습니다; 2) Hint Rejection Sampling Fine-Tuning(Hint-RFT): Hint-RFT는 Hint-infer와 RFT를 결합하여 Hint-infer를 통해 생성된 도구 호출이 포함된 LRM의 추론 궤적을 점수화, 필터링, 수정한 후 LRM을 미세 조정합니다. 이 프레임워크를 통해 QwQ-32B 모델을 미세 조정하여 START를 구현했습니다. 박사 수준의 과학 QA(GPQA), 경쟁 수준의 수학 벤치마크(AMC23, AIME24, AIME25), 그리고 경쟁 수준의 코드 벤치마크(LiveCodeBench)에서 START는 각각 63.6%, 95.0%, 66.7%, 47.1%, 47.3%의 정확도를 달성했습니다. 이는 기본 QwQ-32B를 크게 능가하며, 최신 오픈 가중치 모델 R1-Distill-Qwen-32B와 독점 모델 o1-Preview에 필적하는 성능을 보여줍니다.

English

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., ``Wait, maybe using Python here is a good idea.'') during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

START: 도구를 활용한 자기 주도적 추론기

START: Self-taught Reasoner with Tools

초록

Support