τ-bench:一個針對在真實世界領域中工具-代理人-使用者互動的基準測試。
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
June 17, 2024
作者: Shunyu Yao, Noah Shinn, Pedram Razavi, Karthik Narasimhan
cs.AI
摘要
現有的基準測試並未測試語言代理與人類用戶互動的能力,以及遵循特定領域規則的能力,這兩者對於將它們應用於現實應用中至關重要。我們提出 tau-bench,這是一個基準測試,模擬用戶(由語言模型模擬)與語言代理之間的動態對話,語言代理提供了特定領域的 API 工具和政策指南。我們採用高效且忠實的評估過程,將對話結束時的數據庫狀態與標註的目標狀態進行比較。我們還提出了一個新的指標(pass^k)來評估代理在多次試驗中的行為可靠性。我們的實驗表明,即使是最先進的函數調用代理(如 gpt-4o)也僅在不到 50% 的任務上成功,而且相當不一致(在零售領域 pass^8 <25%)。我們的研究結果指出了需要改進代理能夠一致行動並可靠遵循規則的方法。
English
Existing benchmarks do not test language agents on their interaction with
human users or ability to follow domain-specific rules, both of which are vital
for deploying them in real world applications. We propose tau-bench, a
benchmark emulating dynamic conversations between a user (simulated by language
models) and a language agent provided with domain-specific API tools and policy
guidelines. We employ an efficient and faithful evaluation process that
compares the database state at the end of a conversation with the annotated
goal state. We also propose a new metric (pass^k) to evaluate the reliability
of agent behavior over multiple trials. Our experiments show that even
state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the
tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point
to the need for methods that can improve the ability of agents to act
consistently and follow rules reliably.Summary
AI-Generated Summary