輸入重構如何提升複雜動態環境中工具使用的準確性？基於τ基準的研究

摘要

近期，大型语言模型（LLMs）在推理与规划能力上的显著进展，使其展现出作为自主代理在动态环境中运用工具的潜力。然而，在如tau-bench等多轮对话环境中，这些代理往往难以保持一致的推理逻辑、遵循特定领域的策略，并在长时间的工具调用与对话中准确提取信息。为捕捉并缓解这些缺陷，我们对对话轨迹中常见的错误进行了详尽的手动分析。随后，我们尝试通过重新构建工具调用代理的输入来优化代理的决策过程。最终，我们提出了输入重构多代理（IRMA）框架，该框架能自动重构用户查询，并融入相关领域规则与工具建议，以引导工具调用代理聚焦于关键信息。实验结果显示，在整体通过率（pass^5）得分上，IRMA分别比ReAct、函数调用及自我反思方法高出16.1%、12.7%和19.1%。这些发现凸显了IRMA在动态环境中相较于其他方法所具备的卓越可靠性与一致性。

English

Recent advances in reasoning and planning capabilities of large language models (LLMs) have enabled their potential as autonomous agents capable of tool use in dynamic environments. However, in multi-turn conversational environments like tau-bench, these agents often struggle with consistent reasoning, adherence to domain-specific policies, and extracting correct information over a long horizon of tool-calls and conversation. To capture and mitigate these failures, we conduct a comprehensive manual analysis of the common errors occurring in the conversation trajectories. We then experiment with reformulations of inputs to the tool-calling agent for improvement in agent decision making. Finally, we propose the Input-Reformulation Multi-Agent (IRMA) framework, which automatically reformulates user queries augmented with relevant domain rules and tool suggestions for the tool-calling agent to focus on. The results show that IRMA significantly outperforms ReAct, Function Calling, and Self-Reflection by 16.1%, 12.7%, and 19.1%, respectively, in overall pass^5 scores. These findings highlight the superior reliability and consistency of IRMA compared to other methods in dynamic environments.

輸入重構如何提升複雜動態環境中工具使用的準確性？基於τ基準的研究

How Can Input Reformulation Improve Tool Usage Accuracy in a Complex Dynamic Environment? A Study on τ-bench

摘要

Support