电商基准测试:迈向电商领域基础智能体的全面评估
EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
December 9, 2025
作者: Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R., Fung, Yalong Li, Pengjun Xie
cs.AI
摘要
智能体基础模型在环境推理与交互能力方面进展迅速,使其核心能力的评估变得日益重要。尽管现有基准测试工具层出不穷,但多数聚焦于学术场景或人工设计情境,忽视了实际应用中的挑战。为解决这一问题,我们着眼于极具现实意义的电商领域——该场景不仅包含海量多元用户交互、动态市场环境,更涉及真实决策流程中的各类任务。为此,我们推出EcomBench:一个基于真实电商环境构建的综合性智能体评估基准。该基准源自全球头部电商生态中的真实用户需求,经由专家团队精细标注,确保任务清晰度、准确性与领域相关性。EcomBench覆盖电商场景下的多类任务,设定了三个难度层级,重点评估智能体的深度信息检索、多步推理及跨源知识整合等关键能力。通过将评估根植于真实电商语境,EcomBench为衡量智能体在现代电商环境中的实际能力提供了严谨而动态的测试平台。
English
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.