ReTool: 大規模言語モデルにおける戦略的ツール使用のための強化学習

要旨

推論モデル（例：DeepSeek R1）は強化学習（RL）によって訓練され、テキスト推論において優れた性能を発揮しますが、幾何学的推論、簡潔な計算、または複雑な方程式の解法など、構造化された問題解決を必要とするシナリオでは苦戦します。これらの領域では、コードインタプリタ（CI）などの計算ツールが明確な利点を示します。このギャップを埋めるため、我々はReToolを提案します。ReToolは、ツール統合学習を通じて長文推論を強化し、以下の2つの主要な特徴を備えています：（1）自然言語推論プロセス内でのリアルタイムコード実行の動的なインタリーブ、（2）マルチターンのリアルタイムコード実行を伴うポリシーロールアウトを可能にし、結果フィードバックに基づいてモデルにツールの呼び出しタイミングと方法を教える自動化されたRLパラダイム。ReToolは、合成コールドスタートデータ生成から始まる体系的なトレーニングフレームワークを採用し、ベースモデルのファインチューニングのためのコード拡張長文推論トレースを生成します。その後、RLトレーニングではタスクの結果を報酬として活用し、モデルのツール使用戦略を反復的に洗練させ、人間の事前知識なしに最適なツール呼び出しパターンを自律的に発見できるようにします。難易度の高いMATH OlympiadベンチマークAIMEでの実験は、ReToolの優位性を示しています：我々の32Bモデルは400トレーニングステップで67%の精度を達成し、テキストベースのRLベースライン（40%精度、1080ステップ）を効率と性能の両面で上回りました。さらに、ReTool-32Bは拡張設定で72.5%の精度を達成し、OpenAIのo1-previewを27.9%上回りました。さらなる分析では、コードの自己修正などの創発的な振る舞いが観察され、モデルが適応的なツール使用を自律的に習得する「アハ体験」を示しています。これらの発見は、複雑な数学的推論を進めるための結果駆動型ツール統合の可能性を強調し、ハイブリッドニューロシンボリックシステムに関する新たな洞察を提供します。

English

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

ReTool: 大規模言語モデルにおける戦略的ツール使用のための強化学習

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

要旨

Support