理解工具整合推理
Understanding Tool-Integrated Reasoning
August 26, 2025
作者: Heng Lin, Zhongwen Xu
cs.AI
摘要
我們研究為何工具整合推理(Tool-Integrated Reasoning, TIR)能提升大型語言模型(LLMs)的能力。雖然整合了如Python代碼解釋器等工具的LLMs展現出巨大潛力,但解釋此範式為何有效的理論基礎一直缺失。本工作首次提供了形式化證明,表明TIR從根本上擴展了LLM的能力。我們證明,工具能夠嚴格擴展模型的經驗與可行支持範圍,突破純文本模型的能力上限,解鎖原本不可能或過於繁瑣的問題解決策略。為了在不影響訓練穩定性和性能的前提下引導模型行為,我們還引入了優勢塑造策略優化(Advantage Shaping Policy Optimization, ASPO),這是一種直接修改優勢函數以引導策略行為的新算法。我們在具有挑戰性的數學基準上進行了全面實驗,利用Python解釋器作為外部工具。結果顯示,TIR模型在pass@k指標上顯著優於其純文本對應模型。關鍵的是,這一優勢不僅限於計算密集型問題,還延伸至需要顯著抽象洞察力的問題。我們進一步識別了展示模型如何學會使用工具進行思考的湧現認知模式。最後,我們報告了通過ASPO實現的早期代碼調用和更多互動回合所帶來的工具使用行為改進。總體而言,我們的工作首次為TIR的成功提供了理論解釋,將焦點從工具有效的事實轉移到它們為何及如何實現更強大推理的機制上。
English
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models
(LLMs) more capable. While LLMs integrated with tools like Python code
interpreters show great promise, a principled theory explaining why this
paradigm is effective has been missing. This work provides the first formal
proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that
tools enable a strict expansion of the model's empirical and feasible support,
breaking the capability ceiling of pure-text models by unlocking
problem-solving strategies that are otherwise impossible or intractably
verbose. To guide model behavior without compromising training stability and
performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a
novel algorithm that directly modifies the advantage function to guide the
policy behavior. We conduct comprehensive experiments on challenging
mathematical benchmarks, leveraging a Python interpreter as the external tool.
Our results show that the TIR model decisively outperforms its pure-text
counterpart on the pass@k metric. Crucially, this advantage is not confined to
computationally-intensive problems but extends to those requiring significant
abstract insight. We further identify the emergent cognitive patterns that
illustrate how models learn to think with tools. Finally, we report improved
tool usage behavior with early code invocation and much more interactive turns
with ASPO. Overall, our work provides the first principled explanation for
TIR's success, shifting the focus from the mere fact that tools work to why and
how they enable more powerful reasoning.