τ-bench: 実世界ドメインにおけるツール・エージェント・ユーザーインタラクションのベンチマーク

要旨

既存のベンチマークは、言語エージェントが人間のユーザーとの相互作用やドメイン固有のルールに従う能力をテストしておらず、これらは実世界のアプリケーションに展開する上で極めて重要です。本論文では、tau-benchを提案します。これは、ユーザー（言語モデルによってシミュレート）と、ドメイン固有のAPIツールおよびポリシーガイドラインを提供された言語エージェントとの間の動的な会話を模倣するベンチマークです。我々は、会話終了時のデータベース状態と注釈付きの目標状態を比較する、効率的で忠実な評価プロセスを採用しています。また、エージェントの行動の信頼性を複数回の試行にわたって評価する新しい指標（pass^k）を提案します。実験結果によると、最先端の関数呼び出しエージェント（例えばgpt-4o）でさえ、タスクの50%未満しか成功せず、非常に一貫性が低いことが明らかになりました（小売り分野ではpass^8 <25%）。これらの発見は、エージェントが一貫して行動し、ルールを確実に遵守する能力を向上させる手法の必要性を示唆しています。

English

Existing benchmarks do not test language agents on their interaction with human users or ability to follow domain-specific rules, both of which are vital for deploying them in real world applications. We propose tau-bench, a benchmark emulating dynamic conversations between a user (simulated by language models) and a language agent provided with domain-specific API tools and policy guidelines. We employ an efficient and faithful evaluation process that compares the database state at the end of a conversation with the annotated goal state. We also propose a new metric (pass^k) to evaluate the reliability of agent behavior over multiple trials. Our experiments show that even state-of-the-art function calling agents (like gpt-4o) succeed on <50% of the tasks, and are quite inconsistent (pass^8 <25% in retail). Our findings point to the need for methods that can improve the ability of agents to act consistently and follow rules reliably.

τ-bench: 実世界ドメインにおけるツール・エージェント・ユーザーインタラクションのベンチマーク

τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

要旨

Support