輸入重構如何提升複雜動態環境中工具使用的準確性?基於τ基準的研究
How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench
August 28, 2025
作者: Venkatesh Mishra, Amir Saeidi, Satyam Raj, Mutsumi Nakamura, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral
cs.AI
摘要
近期,大型语言模型(LLMs)在推理与规划能力上的显著进展,使其展现出作为自主代理在动态环境中运用工具的潜力。然而,在如tau-bench等多轮对话环境中,这些代理往往难以保持一致的推理逻辑、遵循特定领域的策略,并在长时间的工具调用与对话中准确提取信息。为捕捉并缓解这些缺陷,我们对对话轨迹中常见的错误进行了详尽的手动分析。随后,我们尝试通过重新构建工具调用代理的输入来优化代理的决策过程。最终,我们提出了输入重构多代理(IRMA)框架,该框架能自动重构用户查询,并融入相关领域规则与工具建议,以引导工具调用代理聚焦于关键信息。实验结果显示,在整体通过率(pass^5)得分上,IRMA分别比ReAct、函数调用及自我反思方法高出16.1%、12.7%和19.1%。这些发现凸显了IRMA在动态环境中相较于其他方法所具备的卓越可靠性与一致性。
English
Recent advances in reasoning and planning capabilities of large language
models (LLMs) have enabled their potential as autonomous agents capable of tool
use in dynamic environments. However, in multi-turn conversational environments
like tau-bench, these agents often struggle with consistent reasoning,
adherence to domain-specific policies, and extracting correct information over
a long horizon of tool-calls and conversation. To capture and mitigate these
failures, we conduct a comprehensive manual analysis of the common errors
occurring in the conversation trajectories. We then experiment with
reformulations of inputs to the tool-calling agent for improvement in agent
decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA)
framework, which automatically reformulates user queries augmented with
relevant domain rules and tool suggestions for the tool-calling agent to focus
on. The results show that IRMA significantly outperforms ReAct, Function
Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in
overall pass^5 scores. These findings highlight the superior reliability and
consistency of IRMA compared to other methods in dynamic environments.