EcomBench: Eコマースにおける基盤エージェントの総合的評価に向けて

要旨

基盤エージェントは、現実環境での推論と相互作用能力が急速に進化しており、その中核的能力の評価がますます重要になっています。既存の多くのベンチマークはエージェント性能の評価を目的としていますが、そのほとんどは学術的設定や人為的に設計されたシナリオに焦点を当てており、実アプリケーションで生じる課題を見落としています。この問題に対処するため、我々は実世界での応用性が極めて高い電子商取引領域に着目します。この領域は、多様なユーザーインタラクションが大量に発生し、市場環境が動的に変化し、現実の意思決定プロセスに直結するタスクを包含する特徴があります。本論文では、現実的なEコマース環境下でのエージェント性能を評価する包括的ベンチマーク「EcomBench」を提案します。EcomBenchは、世界主要Eコマースエコシステムに埋め込まれた実際のユーザー需要に基づいて構築され、明確性、正確性、領域関連性を保証するため専門家による入念な選定と注釈を経ています。Eコマースシナリオ内の複数のタスクカテゴリを網羅し、深層情報検索、多段階推論、クロスソース知識統合といった重要能力を評価する3段階の難易度を定義しています。実Eコマース文脈に根ざした評価を通じて、EcomBenchは現代のEコマースにおいてエージェントが持つ実践的能力を測定する厳密かつ動的なテストベッドを提供します。

English

Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.

EcomBench: Eコマースにおける基盤エージェントの総合的評価に向けて

EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce

要旨

Support