ReTool: 대규모 언어 모델의 전략적 도구 사용을 위한 강화 학습

초록

추론 모델(예: DeepSeek R1)은 강화 학습(RL)을 통해 훈련되어 텍스트 기반 추론에서는 뛰어난 성능을 보이지만, 기하학적 추론, 간결한 계산, 복잡한 방정식 풀이와 같은 구조화된 문제 해결이 필요한 시나리오에서는 어려움을 겪습니다. 이러한 영역에서는 코드 인터프리터(CI)와 같은 계산 도구가 뚜렷한 이점을 보입니다. 이러한 격차를 해소하기 위해, 우리는 도구 통합 학습을 통해 장문 추론을 강화하는 ReTool을 제안합니다. ReTool은 두 가지 주요 기능을 포함합니다: (1) 자연어 추론 과정 내에서 실시간 코드 실행을 동적으로 인터리빙하는 것, (2) 다중 턴 실시간 코드 실행을 통해 정책 롤아웃을 허용하고 결과 피드백을 기반으로 모델이 도구를 언제, 어떻게 호출할지 학습하도록 하는 자동화된 RL 패러다임입니다. ReTool은 체계적인 훈련 프레임워크를 사용하며, 기본 모델을 미세 조정하기 위해 코드가 강화된 장문 추론 트레이스를 생성하는 합성 콜드 스타트 데이터 생성으로 시작합니다. 이후의 RL 훈련은 작업 결과를 보상으로 활용하여 모델의 도구 사용 전략을 반복적으로 개선함으로써, 인간의 사전 지식 없이도 최적의 도구 호출 패턴을 자율적으로 발견할 수 있도록 합니다. 도전적인 MATH Olympiad 벤치마크인 AIME에서의 실험은 ReTool의 우수성을 입증합니다: 우리의 32B 모델은 400번의 훈련 단계로 67%의 정확도를 달성하여, 텍스트 기반 RL 베이스라인(40% 정확도, 1080 단계)보다 효율성과 성능에서 우수했습니다. 특히, ReTool-32B는 확장 설정에서 72.5%의 정확도를 달성하며 OpenAI의 o1-preview를 27.9% 앞섰습니다. 추가 분석은 코드 자가 수정과 같은 새로운 행동을 보여주며, 모델이 적응적 도구 사용을 자율적으로 마스터하는 "아하 순간"을 나타냅니다. 이러한 발견은 결과 기반 도구 통합이 복잡한 수학적 추론을 발전시키는 데 있어 유망함을 강조하며, 하이브리드 신경-기호 시스템에 대한 새로운 통찰을 제공합니다.

English

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

ReTool: 대규모 언어 모델의 전략적 도구 사용을 위한 강화 학습

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

초록

Support