DSBench：資料科學代理人離成為資料科學專家有多遠？

摘要

大型語言模型（LLMs）和大型視覺語言模型（LVLMs）展示了令人印象深刻的語言/視覺推理能力，引發了建立針對特定應用的代理人（如購物助手或人工智慧軟體工程師）的最近趨勢。最近，許多數據科學基準已被提出，以調查它們在數據科學領域的表現。然而，與現實世界的數據科學應用相比，現有的數據科學基準仍然存在不足，原因在於它們的簡化設置。為了彌合這一差距，我們引入了DSBench，這是一個旨在評估具有現實任務的數據科學代理的全面基準。該基準包括466個數據分析任務和74個數據建模任務，來自Eloquence和Kaggle競賽。DSBench通過包含長篇背景、多模態任務背景、與大型數據文件和多表結構進行推理，以及執行端對端數據建模任務，提供了一個現實的設置。我們對最先進的LLMs、LVLMs和代理進行的評估顯示，它們在大多數任務上都遇到困難，最佳代理僅解決了34.12%的數據分析任務，實現了34.74%的相對性能差距（RPG）。這些發現強調了在開發更實用、智能和自主的數據科學代理方面需要進一步的進展。

English

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

DSBench：資料科學代理人離成為資料科學專家有多遠？

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

摘要

Support