τ-bench:一个针对现实世界领域中工具-代理-用户交互的基准测试
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
June 17, 2024
作者: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
cs.AI
摘要
现有的基准测试并未测试语言代理与人类用户的互动或遵循特定领域规则的能力,这两者对于将它们部署到现实世界应用中至关重要。我们提出了tau-bench,这是一个基准测试,模拟用户(由语言模型模拟)与语言代理之间的动态对话,语言代理配备特定领域的API工具和策略指南。我们采用了高效且忠实的评估过程,将对话结束时的数据库状态与注释的目标状态进行比较。我们还提出了一个新的度量标准(pass^k),用于评估代理在多次试验中的行为可靠性。我们的实验表明,即使是最先进的函数调用代理(如gpt-4o)也只能在不到50%的任务上成功,并且相当不一致(在零售领域,pass^8 <25%)。我们的发现表明需要改进代理的能力以一致行动并可靠遵循规则的方法。
English
Existing benchmarks do not test language agents on their interaction with
human users or ability to follow domain-specific rules, both of which are vital
for deploying them in real world applications. We propose tau-bench, a
benchmark emulating dynamic conversations between a user (simulated by language
models) and a language agent provided with domain-specific API tools and policy
guidelines. We employ an efficient and faithful evaluation process that
compares the database state at the end of a conversation with the annotated
goal state. We also propose a new metric (pass^k) to evaluate the reliability
of agent behavior over multiple trials. Our experiments show that even
state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the
tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point
to the need for methods that can improve the ability of agents to act
consistently and follow rules reliably.Summary
AI-Generated Summary