Spider2-V：多模式代理自動化數據科學和工程工作流程還有多遠？

摘要

資料科學和工程工作流程通常涵蓋多個階段，從數據倉庫到管控，使用像是BigQuery、dbt和Airbyte這樣的工具。隨著視覺語言模型（VLMs）在多模態理解和程式碼生成方面的進展，基於VLM的代理人有潛力通過生成SQL查詢、Python程式碼和GUI操作來自動化這些工作流程。這種自動化可以提高專家的生產力，同時實現對大規模數據分析的民主化訪問。在本文中，我們介紹了Spider2-V，這是專注於專業資料科學和工程工作流程的第一個多模態代理人基準，包括494個真實世界任務，在真實的計算機環境中，並整合了20個企業級專業應用程式。這些任務源自真實用例，評估了多模態代理人通過編寫程式碼和管理企業數據軟件系統中的GUI來執行與數據相關任務的能力。為了在現實模擬和評估簡單性之間取得平衡，我們致力於為任務設置開發自動配置，並為每個任務精心製作評估指標。此外，我們通過全面的文件補充多模態代理人的這些企業數據軟件系統。我們的實證評估顯示，現有的最先進的LLM/VLM基於代理無法可靠地自動化完整的數據工作流程（成功率為14.0%）。即使有逐步指導，這些代理在需要細粒度、知識密集型GUI操作的任務中仍表現不佳（16.2%），並涉及遠程雲端工作空間的任務（10.6%）。我們希望Spider2-V為自主多模態代理人改變資料科學和工程工作流程的自動化鋪平道路。我們的程式碼和數據可在https://spider2-v.github.io 上獲得。

English

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io.

Spider2-V：多模式代理自動化數據科學和工程工作流程還有多遠？

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

摘要

Support