ChatPaper.aiChatPaper

理解工具整合推理

Understanding Tool-Integrated Reasoning

August 26, 2025
作者: Heng Lin, Zhongwen Xu
cs.AI

摘要

我們研究為何工具整合推理(Tool-Integrated Reasoning, TIR)能提升大型語言模型(LLMs)的能力。雖然整合了如Python代碼解釋器等工具的LLMs展現出巨大潛力,但解釋此範式為何有效的理論基礎一直缺失。本工作首次提供了形式化證明,表明TIR從根本上擴展了LLM的能力。我們證明,工具能夠嚴格擴展模型的經驗與可行支持範圍,突破純文本模型的能力上限,解鎖原本不可能或過於繁瑣的問題解決策略。為了在不影響訓練穩定性和性能的前提下引導模型行為,我們還引入了優勢塑造策略優化(Advantage Shaping Policy Optimization, ASPO),這是一種直接修改優勢函數以引導策略行為的新算法。我們在具有挑戰性的數學基準上進行了全面實驗,利用Python解釋器作為外部工具。結果顯示,TIR模型在pass@k指標上顯著優於其純文本對應模型。關鍵的是,這一優勢不僅限於計算密集型問題,還延伸至需要顯著抽象洞察力的問題。我們進一步識別了展示模型如何學會使用工具進行思考的湧現認知模式。最後,我們報告了通過ASPO實現的早期代碼調用和更多互動回合所帶來的工具使用行為改進。總體而言,我們的工作首次為TIR的成功提供了理論解釋,將焦點從工具有效的事實轉移到它們為何及如何實現更強大推理的機制上。
English
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
PDF211August 27, 2025