PORTool: 報酬付きツリー構造を用いたツール利用LLMトレーニング

要旨

現在のツール利用大規模言語モデル（LLM）は静的なデータセットで学習されており、外部ツールとの連携や複数ステップにわたるツール統合型推論を可能にし、ツール呼び出し軌道を生成する。しかし、これらのモデルは汎用的なツール呼び出しルーチンにおけるクエリ解決方法を模倣するに留まるため、可能性のある解決策を探索できず、進化した動的なツール呼び出し環境では限定的な性能しか発揮しない。本研究では、ツール利用LLMが正答をもたらす多様な軌道を探索することを促進する強化学習（RL）手法PORToolを提案する。具体的には、まず与えられたクエリに対し複数のロールアウトを生成し、その一部は最初の数ステップのツール呼び出しを共有することで木構造を形成する。次に、各ステップに対して正答の生成能力とツール呼び出しの成功度に基づき報酬を付与する。異なる軌道間で共有されるステップは同一の報酬を受け、同一分岐下の異なるステップは異なる報酬を受ける。最後に、これらのステップ単位の報酬を用いて分岐相対アドバンテージを算出し、軌道相対アドバンテージと組み合わせることでLLMのツール利用能力を学習させる。実験では17種類のツールを活用し、時間敏感性と非敏感性の両主題を網羅するユーザークエリに対応する。アブレーション研究を通じて、ステップ単位報酬の必要性と設計の堅牢性を体系的に検証する。さらに、提案手法PORToolを他の学習手法と比較し、最終精度およびツール呼び出しステップ数において顕著な改善を実証する。

English

Current tool-use large language models (LLMs) are trained on static datasets, enabling them to interact with external tools and perform multi-step, tool-integrated reasoning, which produces tool-call trajectories. However, these models imitate how a query is resolved in a generic tool-call routine, thereby failing to explore possible solutions and demonstrating limited performance in an evolved, dynamic tool-call environment. In this work, we propose PORTool, a reinforcement learning (RL) method that encourages a tool-use LLM to explore various trajectories yielding the correct answer. Specifically, this method starts with generating multiple rollouts for a given query, and some of them share the first few tool-call steps, thereby forming a tree-like structure. Next, we assign rewards to each step, based on its ability to produce a correct answer and make successful tool calls. A shared step across different trajectories receives the same reward, while different steps under the same fork receive different rewards. Finally, these step-wise rewards are used to calculate fork-relative advantages, blended with trajectory-relative advantages, to train the LLM for tool use. The experiments utilize 17 tools to address user queries, covering both time-sensitive and time-invariant topics. We conduct ablation studies to systematically justify the necessity and the design robustness of step-wise rewards. Furthermore, we compare the proposed PORTool with other training approaches and demonstrate significant improvements in final accuracy and the number of tool-call steps.

PORTool: 報酬付きツリー構造を用いたツール利用LLMトレーニング

PORTool: Tool-Use LLM Training with Rewarded Tree

要旨

Support