τ^2-Bench：二重制御環境における対話エージェントの評価

要旨

既存の対話型AIエージェントのベンチマークは、単一制御環境をシミュレートしており、AIエージェントのみがツールを使用して世界と相互作用し、ユーザーは受動的な情報提供者として留まります。これは、ユーザーが（共有された）世界の状態を積極的に変更する必要があるテクニカルサポートなどの現実世界のシナリオとは異なります。このギャップを埋めるために、我々はtau^2-benchを導入し、以下の4つの主要な貢献を行います： 1) Dec-POMDPとしてモデル化された新しいテレコム双対制御ドメイン。ここでは、エージェントとユーザーの両方がツールを使用して共有された動的環境で行動し、エージェントの調整とコミュニケーションをテストします。 2) 原子コンポーネントからプログラム的に多様で検証可能なタスクを作成する構成タスクジェネレータ。これにより、ドメインのカバレッジと制御された複雑さが確保されます。 3) 環境と密接に結合した信頼性の高いユーザーシミュレータ。その動作はツールと観測可能な状態によって制約され、シミュレーションの忠実度が向上します。 4) 推論とコミュニケーション/調整に起因するエラーを分離するなど、複数のアブレーションを通じたエージェントのパフォーマンスの詳細な分析。特に、我々の実験では、エージェントがユーザーなしから双対制御に移行した際にパフォーマンスが大幅に低下し、ユーザーを導くことの難しさが浮き彫りになりました。全体として、tau^2-benchは、効果的に推論し、ユーザーの行動を導く必要があるエージェントのための制御されたテストベッドを提供します。

English

Existing benchmarks for conversational AI agents simulate single-control environments, where only the AI agent can use tools to interact with the world, while the user remains a passive information provider. This differs from real-world scenarios like technical support, where users need to actively participate in modifying the state of the (shared) world. In order to address this gap, we introduce tau^2-bench, with four key contributions: 1) A novel Telecom dual-control domain modeled as a Dec-POMDP, where both agent and user make use of tools to act in a shared, dynamic environment that tests both agent coordination and communication, 2) A compositional task generator that programmatically creates diverse, verifiable tasks from atomic components, ensuring domain coverage and controlled complexity, 3) A reliable user simulator tightly coupled with the environment, whose behavior is constrained by tools and observable states, improving simulation fidelity, 4) Fine-grained analysis of agent performance through multiple ablations including separating errors arising from reasoning vs communication/coordination. In particular, our experiments show significant performance drops when agents shift from no-user to dual-control, highlighting the challenges of guiding users. Overall, tau^2-bench provides a controlled testbed for agents that must both reason effectively and guide user actions.

τ^2-Bench：二重制御環境における対話エージェントの評価

τ^2-Bench: Evaluating Conversational Agents in a Dual-Control Environment

要旨

Support