τ^2-Bench:在雙重控制環境下評估對話代理
τ^2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
June 9, 2025
作者: Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan
cs.AI
摘要
現有的對話式AI代理基準測試模擬的是單一控制環境,其中僅有AI代理能夠使用工具與世界互動,而用戶則保持被動的信息提供者角色。這與現實世界中的場景(如技術支援)有所不同,在這些場景中,用戶需要積極參與修改(共享)世界的狀態。為了解決這一差距,我們引入了tau^2-bench,其具有四個關鍵貢獻:
1) 一個新穎的電信雙控制領域,建模為Dec-POMDP,其中代理和用戶都利用工具在一個共享的動態環境中行動,該環境測試代理的協調與溝通能力,
2) 一個組合式任務生成器,能夠從原子組件中程序化地創建多樣化且可驗證的任務,確保領域覆蓋和受控的複雜性,
3) 一個與環境緊密耦合的可靠用戶模擬器,其行為受到工具和可觀察狀態的約束,提高了模擬的真實性,
4) 通過多種消融實驗對代理性能進行細粒度分析,包括區分由推理與溝通/協調引起的錯誤。
特別是,我們的實驗顯示,當代理從無用戶控制轉向雙控制時,性能顯著下降,這凸顯了引導用戶的挑戰。總體而言,tau^2-bench為那些必須有效推理並引導用戶行動的代理提供了一個受控的測試平台。
English
Existing benchmarks for conversational AI agents simulate single-control
environments, where only the AI agent can use tools to interact with the world,
while the user remains a passive information provider. This differs from
real-world scenarios like technical support, where users need to actively
participate in modifying the state of the (shared) world. In order to address
this gap, we introduce tau^2-bench, with four key contributions:
1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both
agent and user make use of tools to act in a shared, dynamic environment that
tests both agent coordination and communication,
2) A compositional task generator that programmatically creates diverse,
verifiable tasks from atomic components, ensuring domain coverage and
controlled complexity,
3) A reliable user simulator tightly coupled with the environment, whose
behavior is constrained by tools and observable states, improving simulation
fidelity,
4) Fine-grained analysis of agent performance through multiple ablations
including separating errors arising from reasoning vs
communication/coordination.
In particular, our experiments show significant performance drops when agents
shift from no-user to dual-control, highlighting the challenges of guiding
users. Overall, tau^2-bench provides a controlled testbed for agents that
must both reason effectively and guide user actions.