ChatPaper.aiChatPaper

VitaBench:基於多樣化互動任務的LLM代理基準測試於現實世界應用

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

September 30, 2025
作者: Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao
cs.AI

摘要

随着基于大语言模型(LLM)的智能体在现实生活场景中的部署日益增多,现有基准测试已无法充分捕捉其在处理海量信息、利用多样化资源以及管理动态用户交互方面的固有复杂性。为填补这一空白,我们推出了VitaBench,一个旨在评估智能体在真实世界情境下执行多样化交互任务能力的挑战性基准测试。VitaBench从外卖配送、店内消费及在线旅游服务等日常应用中汲取灵感,为智能体构建了迄今为止最为复杂的生活服务模拟环境,包含66种工具。通过一个摒弃领域特定策略的框架,我们实现了这些场景与工具的灵活组合,生成了100个跨场景任务(主要结果)和300个单场景任务。每项任务均源自多个真实用户请求,要求智能体跨越时空维度进行推理,运用复杂的工具集,主动澄清模糊指令,并在多轮对话中追踪用户意图的变化。此外,我们提出了一种基于评分标准的滑动窗口评估器,能够在复杂环境及随机交互中,对多样化的解决路径进行稳健评估。我们的全面评估显示,即便是最先进的模型,在跨场景任务上的成功率也仅为30%,而在其他任务上的成功率则不足50%。总体而言,我们相信VitaBench将成为推动AI智能体在实际应用中发展的重要资源。代码、数据集及排行榜可通过https://vitabench.github.io/获取。
English
As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/
PDF121October 1, 2025