PORTool:基于奖励树机制的工具调用大语言模型训练框架
PORTool: Tool-Use LLM Training with Rewarded Tree
October 29, 2025
作者: Feijie Wu, Weiwu Zhu, Yuxiang Zhang, Soumya Chatterjee, Jiarong Zhu, Fan Mo, Rodin Luo, Jing Gao
cs.AI
摘要
当前基于工具使用的大型语言模型(LLMs)通常在静态数据集上训练,使其能够与外部工具交互并执行多步骤、工具集成的推理,从而生成工具调用轨迹。然而,这些模型仅模仿通用工具调用流程中查询的解决方式,未能探索可能的解决方案,在动态演变的工具调用环境中表现受限。本研究提出PORTool——一种强化学习方法,通过激励工具使用型LLM探索能产生正确答案的多样化轨迹。具体而言,该方法首先生成针对给定查询的多个执行路径,其中部分路径共享初始工具调用步骤,从而形成树状结构。随后根据每个步骤生成正确答案的能力及成功完成工具调用的表现分配奖励:不同轨迹中的共享步骤获得相同奖励,而同一分支下的不同步骤则获得差异化奖励。最后,这些逐步骤奖励被用于计算分支相对优势值,并与轨迹相对优势值融合,以训练LLM的工具使用能力。实验采用17种工具处理用户查询,涵盖时间敏感型与时间无关型主题。我们通过消融研究系统验证了逐步骤奖励的必要性及设计鲁棒性,并将PORTool与其他训练方法对比,在最终准确率和工具调用步骤数方面均展现出显著提升。
English
Current tool-use large language models (LLMs) are trained on static datasets,
enabling them to interact with external tools and perform multi-step,
tool-integrated reasoning, which produces tool-call trajectories. However,
these models imitate how a query is resolved in a generic tool-call routine,
thereby failing to explore possible solutions and demonstrating limited
performance in an evolved, dynamic tool-call environment. In this work, we
propose PORTool, a reinforcement learning (RL) method that encourages a
tool-use LLM to explore various trajectories yielding the correct answer.
Specifically, this method starts with generating multiple rollouts for a given
query, and some of them share the first few tool-call steps, thereby forming a
tree-like structure. Next, we assign rewards to each step, based on its ability
to produce a correct answer and make successful tool calls. A shared step
across different trajectories receives the same reward, while different steps
under the same fork receive different rewards. Finally, these step-wise rewards
are used to calculate fork-relative advantages, blended with
trajectory-relative advantages, to train the LLM for tool use. The experiments
utilize 17 tools to address user queries, covering both time-sensitive and
time-invariant topics. We conduct ablation studies to systematically justify
the necessity and the design robustness of step-wise rewards. Furthermore, we
compare the proposed PORTool with other training approaches and demonstrate
significant improvements in final accuracy and the number of tool-call steps.