Spider2-V：多模态代理自动化数据科学和工程工作流还有多远？

摘要

数据科学和工程工作流通常涵盖多个阶段，从数据仓库到编排，使用诸如BigQuery、dbt和Airbyte之类的工具。随着视觉语言模型（VLMs）在多模态理解和代码生成方面的进展，基于VLM的代理可能通过生成SQL查询、Python代码和GUI操作来自动化这些工作流程。这种自动化可以提高专家的生产力，同时使大规模数据分析变得更加民主化。在本文中，我们介绍了Spider2-V，这是第一个专注于专业数据科学和工程工作流的多模态代理基准，包括在真实计算机环境中的494个真实世界任务，涵盖了20个企业级专业应用程序。这些任务源自真实用例，评估了多模态代理通过编写代码和管理企业数据软件系统中的GUI来执行数据相关任务的能力。为了在现实仿真和评估简易性之间取得平衡，我们致力于为任务设置开发自动配置，并为每个任务精心制定评估指标。此外，我们为这些企业数据软件系统提供了全面的文档，以补充多模态代理的功能。我们的实证评估显示，现有的最先进的LLM/VLM基于代理不能可靠地自动化完整的数据工作流程（成功率为14.0%）。即使在逐步指导下，这些代理在需要精细、知识密集型GUI操作（16.2%）和涉及远程云托管工作空间（10.6%）的任务中仍表现不佳。我们希望Spider2-V为自主多模态代理改变数据科学和工程工作流的自动化铺平道路。我们的代码和数据可在https://spider2-v.github.io获取。

English

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io.

Spider2-V：多模态代理自动化数据科学和工程工作流还有多远？

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

摘要

Support