ReTool：大型語言模型中策略性工具使用的強化學習

摘要

儘管通過強化學習（RL）訓練的推理模型（如DeepSeek R1）在文本推理方面表現出色，但在需要結構化問題解決的場景中，如幾何推理、簡潔計算或複雜方程求解，這些模型卻顯得力不從心——這些領域正是代碼解釋器（CI）等計算工具展現出明顯優勢的地方。為彌合這一差距，我們提出了ReTool，它通過工具集成學習增強了長篇推理能力，具備兩大關鍵特性：（1）在自然語言推理過程中動態交織實時代碼執行；（2）自動化RL範式，支持多輪實時代碼執行的策略推演，並基於結果反饋教導模型何時及如何調用工具。ReTool採用系統化的訓練框架，首先生成合成冷啟動數據以產生代碼增強的長篇推理軌跡，用於微調基礎模型。隨後的RL訓練利用任務結果作為獎勵，迭代優化模型的工具使用策略，使其無需人類先驗知識即可自主發現最佳工具調用模式。在具有挑戰性的數學奧林匹克競賽基準AIME上的實驗證明了ReTool的優越性：我們的32B模型僅需400次訓練步驟便達到了67%的準確率，在效率和性能上均優於基於文本的RL基線（40%準確率，1080步）。值得注意的是，ReTool-32B在擴展設置中達到了72.5%的準確率，超越了OpenAI的o1-preview模型27.9%。進一步分析揭示了諸如代碼自我修正等湧現行為，標誌著模型自主掌握適應性工具使用的“頓悟時刻”。這些發現凸顯了結果驅動的工具集成在推進複雜數學推理方面的潛力，並為混合神經符號系統提供了新的見解。

English

While reasoning models (e.g., DeepSeek R1) trained with reinforcement learning (RL), excel in textual reasoning, they struggle in scenarios requiring structured problem-solving, such as geometric reasoning, concise computation, or complex equation solving-areas where computational tools like code interpreters (CI) demonstrate distinct advantages. To bridge this gap, we propose ReTool, which enhances long-form reasoning with tool-integrated learning, including two key features: (1) dynamic interleaving of real-time code execution within natural language reasoning processes, and (2) an automated RL paradigm that allows policy rollouts with multi-turn real-time code execution and teaches the model in learning when and how to invoke tools based on outcome feedback. ReTool employs a systematic training framework, beginning with synthetic cold-start data generation to produce code-augmented long-form reasoning traces for fine-tuning base models. Subsequent RL training leverages task outcomes as rewards to iteratively refine the model's tool use strategy, enabling autonomous discovery of optimal tool invocation patterns without human priors. Experiments on the challenging MATH Olympiad benchmark AIME demonstrate ReTool's superiority: Our 32B model achieves 67% accuracy with 400 training steps, outperforming text-based RL baseline (40% accuracy, 1080 steps) in efficiency and performance. Remarkably, ReTool-32B attains 72.5% accuracy in extended settings, surpassing OpenAI's o1-preview by 27.9%. Further analysis reveals emergent behaviors such as code self-correction, signaling an ''aha moment'' in which the model autonomously masters adaptive tool use. These findings highlight the promise of outcome-driven tool integration for advancing complex mathematical reasoning and offer new insights into hybrid neuro-symbolic systems.

ReTool：大型語言模型中策略性工具使用的強化學習

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

摘要

Support