VitaBench:面向现实应用的多功能交互任务基准测试LLM智能体
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
September 30, 2025
作者: Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao
cs.AI
摘要
随着基于大语言模型(LLM)的智能体日益融入现实生活场景,现有基准测试难以全面衡量其在处理海量信息、整合多样化资源及应对动态用户交互方面的内在复杂性。为此,我们推出了VitaBench,一个旨在评估智能体在真实世界情境下执行多功能交互任务的挑战性基准。VitaBench汲取了外卖配送、店内消费及在线旅游服务等日常应用场景,为智能体构建了迄今为止最为复杂的生活服务模拟环境,包含66种工具。通过一个摒弃领域特定策略的框架,我们实现了这些场景与工具的灵活组合,生成了100项跨场景任务(主要结果)和300项单一场景任务。每项任务均源自多个真实用户请求,要求智能体跨越时空维度进行推理,运用复杂工具集,主动澄清模糊指令,并在多轮对话中追踪用户意图的变化。此外,我们提出了一种基于评分标准的滑动窗口评估器,能够在复杂环境及随机交互中,对多样化的解决路径进行稳健评估。我们的全面评估显示,即便是最先进的模型,在跨场景任务上的成功率也仅为30%,而在其他任务上则不足50%。总体而言,我们相信VitaBench将成为推动AI智能体在实际应用中发展的重要资源。代码、数据集及排行榜可访问https://vitabench.github.io/获取。
English
As LLM-based agents are increasingly deployed in real-life scenarios,
existing benchmarks fail to capture their inherent complexity of handling
extensive information, leveraging diverse resources, and managing dynamic user
interactions. To address this gap, we introduce VitaBench, a challenging
benchmark that evaluates agents on versatile interactive tasks grounded in
real-world settings. Drawing from daily applications in food delivery, in-store
consumption, and online travel services, VitaBench presents agents with the
most complex life-serving simulation environment to date, comprising 66 tools.
Through a framework that eliminates domain-specific policies, we enable
flexible composition of these scenarios and tools, yielding 100 cross-scenario
tasks (main results) and 300 single-scenario tasks. Each task is derived from
multiple real user requests and requires agents to reason across temporal and
spatial dimensions, utilize complex tool sets, proactively clarify ambiguous
instructions, and track shifting user intent throughout multi-turn
conversations. Moreover, we propose a rubric-based sliding window evaluator,
enabling robust assessment of diverse solution pathways in complex environments
and stochastic interactions. Our comprehensive evaluation reveals that even the
most advanced models achieve only 30% success rate on cross-scenario tasks, and
less than 50% success rate on others. Overall, we believe VitaBench will serve
as a valuable resource for advancing the development of AI agents in practical
real-world applications. The code, dataset, and leaderboard are available at
https://vitabench.github.io/