DSBench：数据科学代理人离成为数据科学专家还有多远？

摘要

大型语言模型（LLMs）和大型视觉语言模型（LVLMs）展示了令人印象深刻的语言/视觉推理能力，引发了构建针对特定应用的代理程序的最新趋势，例如购物助手或人工智能软件工程师。最近，许多数据科学基准已被提出，以调查它们在数据科学领域的表现。然而，与真实世界的数据科学应用相比，现有的数据科学基准仍然存在不足，因为它们的设置过于简化。为了弥合这一差距，我们引入了DSBench，一个旨在评估具有现实任务的数据科学代理的全面基准。该基准包括466个数据分析任务和74个数据建模任务，这些任务来自Eloquence和Kaggle竞赛。DSBench通过包含长上下文、多模态任务背景、处理大型数据文件和多表结构的推理，以及执行端到端数据建模任务，提供了一个真实的设置。我们对最先进的LLMs、LVLMs和代理的评估表明，它们在大多数任务上都存在困难，最佳代理仅解决了34.12%的数据分析任务，并实现了34.74%的相对性能差距（RPG）。这些发现强调了在开发更实用、智能和自主的数据科学代理方面需要进一步的进展。

English

Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

DSBench：数据科学代理人离成为数据科学专家还有多远？

DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?

摘要

Support