ChatPaper.aiChatPaper

理解工具集成推理

Understanding Tool-Integrated Reasoning

August 26, 2025
作者: Heng Lin, Zhongwen Xu
cs.AI

摘要

我们研究了为何工具集成推理(Tool-Integrated Reasoning, TIR)能增强大语言模型(LLMs)的能力。尽管与Python代码解释器等工具集成的LLMs展现出巨大潜力,但解释这一范式为何有效的理论框架一直缺失。本研究首次提供了形式化证明,表明TIR从根本上扩展了LLM的能力。我们证明,工具能够严格扩展模型的经验与可行支持范围,突破纯文本模型的能力上限,解锁原本不可能或极其冗长的问题解决策略。为了在不影响训练稳定性和性能的前提下引导模型行为,我们还引入了优势塑造策略优化(Advantage Shaping Policy Optimization, ASPO),这是一种直接修改优势函数以指导策略行为的新算法。我们在具有挑战性的数学基准上进行了全面实验,利用Python解释器作为外部工具。结果显示,TIR模型在pass@k指标上显著优于纯文本模型。重要的是,这一优势不仅限于计算密集型问题,还延伸至需要深刻抽象洞察的问题。我们进一步识别了模型如何学会借助工具思考的涌现认知模式。最后,我们报告了通过早期代码调用和更多交互轮次,ASPO改善了工具使用行为。总体而言,我们的工作首次为TIR的成功提供了理论解释,将关注点从工具有效这一事实转向了它们为何及如何促成更强大的推理能力。
English
We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.
PDF161August 27, 2025