ツール統合理解

要旨

ツール統合型推論（TIR）が大規模言語モデル（LLM）の能力を向上させる理由について研究を行った。Pythonコードインタプリタのようなツールと統合されたLLMは非常に有望であるが、このパラダイムが効果的である理由を説明する体系的な理論が欠けていた。本研究では、TIRがLLMの能力を根本的に拡張することを初めて正式に証明する。ツールがモデルの経験的かつ実現可能なサポートを厳密に拡張し、純粋なテキストモデルの能力の限界を打破し、それ以外では不可能または扱いにくいほど冗長な問題解決戦略を可能にすることを示す。また、訓練の安定性と性能を損なうことなくモデルの行動を導くために、アドバンテージ関数を直接修正してポリシーの行動を導く新しいアルゴリズムであるアドバンテージシェイピングポリシー最適化（ASPO）を導入する。外部ツールとしてPythonインタプリタを活用し、挑戦的な数学的ベンチマークで包括的な実験を行った。その結果、TIRモデルがpass@kメトリックにおいて純粋なテキストモデルを決定的に上回ることが示された。重要なことに、この優位性は計算集約的な問題に限定されず、重要な抽象的洞察を必要とする問題にも及ぶ。さらに、モデルがツールを使って考える方法を示す新たな認知パターンを特定した。最後に、ASPOを用いることで、早期のコード呼び出しとよりインタラクティブなターンによるツール使用行動の改善を報告する。全体として、本研究はTIRの成功に対する初めての体系的な説明を提供し、ツールが機能するという事実から、なぜどのようにしてより強力な推論を可能にするかに焦点を移すものである。

English

We study why Tool-Integrated Reasoning (TIR) makes Large Language Models (LLMs) more capable. While LLMs integrated with tools like Python code interpreters show great promise, a principled theory explaining why this paradigm is effective has been missing. This work provides the first formal proof that TIR fundamentally expands an LLM's capabilities. We demonstrate that tools enable a strict expansion of the model's empirical and feasible support, breaking the capability ceiling of pure-text models by unlocking problem-solving strategies that are otherwise impossible or intractably verbose. To guide model behavior without compromising training stability and performance, we also introduce Advantage Shaping Policy Optimization (ASPO), a novel algorithm that directly modifies the advantage function to guide the policy behavior. We conduct comprehensive experiments on challenging mathematical benchmarks, leveraging a Python interpreter as the external tool. Our results show that the TIR model decisively outperforms its pure-text counterpart on the pass@k metric. Crucially, this advantage is not confined to computationally-intensive problems but extends to those requiring significant abstract insight. We further identify the emergent cognitive patterns that illustrate how models learn to think with tools. Finally, we report improved tool usage behavior with early code invocation and much more interactive turns with ASPO. Overall, our work provides the first principled explanation for TIR's success, shifting the focus from the mere fact that tools work to why and how they enable more powerful reasoning.

ツール統合理解

Understanding Tool-Integrated Reasoning

要旨

Support