Spider2-V:多模式代理自動化數據科學和工程工作流程還有多遠?
Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?
July 15, 2024
作者: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu
cs.AI
摘要
資料科學和工程工作流程通常涵蓋多個階段,從數據倉庫到管控,使用像是BigQuery、dbt和Airbyte這樣的工具。隨著視覺語言模型(VLMs)在多模態理解和程式碼生成方面的進展,基於VLM的代理人有潛力通過生成SQL查詢、Python程式碼和GUI操作來自動化這些工作流程。這種自動化可以提高專家的生產力,同時實現對大規模數據分析的民主化訪問。在本文中,我們介紹了Spider2-V,這是專注於專業資料科學和工程工作流程的第一個多模態代理人基準,包括494個真實世界任務,在真實的計算機環境中,並整合了20個企業級專業應用程式。這些任務源自真實用例,評估了多模態代理人通過編寫程式碼和管理企業數據軟件系統中的GUI來執行與數據相關任務的能力。為了在現實模擬和評估簡單性之間取得平衡,我們致力於為任務設置開發自動配置,並為每個任務精心製作評估指標。此外,我們通過全面的文件補充多模態代理人的這些企業數據軟件系統。我們的實證評估顯示,現有的最先進的LLM/VLM基於代理無法可靠地自動化完整的數據工作流程(成功率為14.0%)。即使有逐步指導,這些代理在需要細粒度、知識密集型GUI操作的任務中仍表現不佳(16.2%),並涉及遠程雲端工作空間的任務(10.6%)。我們希望Spider2-V為自主多模態代理人改變資料科學和工程工作流程的自動化鋪平道路。我們的程式碼和數據可在https://spider2-v.github.io 上獲得。
English
Data science and engineering workflows often span multiple stages, from
warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As
vision language models (VLMs) advance in multimodal understanding and code
generation, VLM-based agents could potentially automate these workflows by
generating SQL queries, Python code, and GUI operations. This automation can
improve the productivity of experts while democratizing access to large-scale
data analysis. In this paper, we introduce Spider2-V, the first multimodal
agent benchmark focusing on professional data science and engineering
workflows, featuring 494 real-world tasks in authentic computer environments
and incorporating 20 enterprise-level professional applications. These tasks,
derived from real-world use cases, evaluate the ability of a multimodal agent
to perform data-related tasks by writing code and managing the GUI in
enterprise data software systems. To balance realistic simulation with
evaluation simplicity, we devote significant effort to developing automatic
configurations for task setup and carefully crafting evaluation metrics for
each task. Furthermore, we supplement multimodal agents with comprehensive
documents of these enterprise data software systems. Our empirical evaluation
reveals that existing state-of-the-art LLM/VLM-based agents do not reliably
automate full data workflows (14.0% success). Even with step-by-step guidance,
these agents still underperform in tasks that require fine-grained,
knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted
workspaces (10.6%). We hope that Spider2-V paves the way for autonomous
multimodal agents to transform the automation of data science and engineering
workflow. Our code and data are available at https://spider2-v.github.io.Summary
AI-Generated Summary