Spider2-V: 멀티모달 에이전트가 데이터 과학 및 엔지니어링 워크플로우 자동화까지 얼마나 다가갔는가?

초록

데이터 과학 및 엔지니어링 워크플로우는 BigQuery, dbt, Airbyte와 같은 도구를 사용하여 웨어하우징부터 오케스트레이션에 이르기까지 여러 단계에 걸쳐 진행됩니다. 비전 언어 모델(VLMs)이 멀티모달 이해와 코드 생성 분야에서 발전함에 따라, VLM 기반 에이전트가 SQL 쿼리, Python 코드, GUI 작업을 생성하여 이러한 워크플로우를 자동화할 가능성이 있습니다. 이러한 자동화는 전문가의 생산성을 향상시키는 동시에 대규모 데이터 분석에 대한 접근성을 민주화할 수 있습니다. 본 논문에서는 전문 데이터 과학 및 엔지니어링 워크플로우에 초점을 맞춘 최초의 멀티모달 에이전트 벤치마크인 Spider2-V를 소개합니다. 이 벤치마크는 실제 컴퓨터 환경에서의 494개의 실질적인 작업과 20개의 기업 수준 전문 애플리케이션을 포함합니다. 이러한 작업들은 실제 사용 사례에서 도출되었으며, 멀티모달 에이전트가 코드 작성과 기업 데이터 소프트웨어 시스템의 GUI 관리를 통해 데이터 관련 작업을 수행하는 능력을 평가합니다. 현실적인 시뮬레이션과 평가의 간편함을 균형 있게 유지하기 위해, 우리는 작업 설정을 위한 자동 구성 개발과 각 작업에 대한 평가 지표를 신중하게 설계하는 데 상당한 노력을 기울였습니다. 또한, 멀티모달 에이전트에 이러한 기업 데이터 소프트웨어 시스템의 포괄적인 문서를 제공합니다. 우리의 실증적 평가 결과, 기존의 최첨단 LLM/VLM 기반 에이전트는 전체 데이터 워크플로우를 안정적으로 자동화하지 못했습니다(14.0% 성공률). 단계별 지침이 제공되더라도, 이러한 에이전트는 세밀하고 지식 집약적인 GUI 작업(16.2%)과 원격 클라우드 호스팅 작업 공간(10.6%)이 필요한 작업에서 여전히 낮은 성능을 보였습니다. 우리는 Spider2-V가 자율적인 멀티모달 에이전트가 데이터 과학 및 엔지니어링 워크플로우 자동화를 혁신하는 데 길을 열어주기를 바랍니다. 우리의 코드와 데이터는 https://spider2-v.github.io에서 확인할 수 있습니다.

English

Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivity of experts while democratizing access to large-scale data analysis. In this paper, we introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering workflows, featuring 494 real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications. These tasks, derived from real-world use cases, evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems. To balance realistic simulation with evaluation simplicity, we devote significant effort to developing automatic configurations for task setup and carefully crafting evaluation metrics for each task. Furthermore, we supplement multimodal agents with comprehensive documents of these enterprise data software systems. Our empirical evaluation reveals that existing state-of-the-art LLM/VLM-based agents do not reliably automate full data workflows (14.0% success). Even with step-by-step guidance, these agents still underperform in tasks that require fine-grained, knowledge-intensive GUI actions (16.2%) and involve remote cloud-hosted workspaces (10.6%). We hope that Spider2-V paves the way for autonomous multimodal agents to transform the automation of data science and engineering workflow. Our code and data are available at https://spider2-v.github.io.

Spider2-V: 멀티모달 에이전트가 데이터 과학 및 엔지니어링 워크플로우 자동화까지 얼마나 다가갔는가?

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

초록

Support