ChatPaper.aiChatPaper

τ^2-Bench:双控环境下对话代理的评估平台

τ^2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

June 9, 2025
作者: Victor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan
cs.AI

摘要

现有的对话式AI代理基准测试模拟的是单一控制环境,其中仅AI代理能够使用工具与世界互动,而用户则作为被动的信息提供者。这与现实场景(如技术支持)有所不同,在这些场景中,用户需要积极参与修改(共享)世界的状态。为了弥补这一差距,我们引入了tau^2-bench,其主要贡献包括: 1) 一个新颖的电信双控制领域,建模为Dec-POMDP(分散式部分可观测马尔可夫决策过程),其中代理和用户均利用工具在共享的动态环境中行动,考验代理的协调与沟通能力, 2) 一个组合式任务生成器,通过编程从原子组件创建多样且可验证的任务,确保领域覆盖与复杂度可控, 3) 一个与环境紧密耦合的可靠用户模拟器,其行为受工具和可观测状态约束,提升了模拟的真实性, 4) 通过多重消融实验对代理性能进行细粒度分析,包括区分推理错误与沟通/协调错误。 特别地,我们的实验表明,当代理从无用户控制转向双控制时,性能显著下降,凸显了引导用户的挑战。总体而言,tau^2-bench为那些必须有效推理并引导用户行动的代理提供了一个可控的测试平台。
English
Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce tau^2-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, tau^2-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.
PDF42June 10, 2025