ChatPaper.aiChatPaper

DSAEval:基于广泛现实世界数据科学问题的数据科学智能体评估框架

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

January 20, 2026
作者: Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang
cs.AI

摘要

近期基于大语言模型的数据智能体致力于实现从数据分析到深度学习等数据科学任务的自动化。然而,真实世界数据科学问题具有开放性的特点——常跨越多个分类体系且缺乏标准答案——这为评估工作带来重大挑战。为此,我们推出DSAEval基准测试,该基准包含基于285个多样化数据集的641个真实数据科学问题,涵盖结构化与非结构化数据(如视觉与文本数据)。DSAEval具备三大特色:(1) 多模态环境感知能力,使智能体能够解读包括文本和视觉在内的多模态观察结果;(2) 多轮次交互机制,模拟真实数据科学项目中迭代与累积的特性;(3) 多维度评估体系,从推理过程、代码实现与结果输出三个维度进行整体评估。我们使用DSAEval对11个先进的大模型智能体展开系统性评估。结果表明:Claude-Sonnet-4.5综合表现最强,GPT-5.2效率最高,而MiMo-V2-Flash最具成本效益。我们进一步证实多模态感知能持续提升视觉相关任务的表现,改进幅度达2.04%至11.30%。总体而言,当前数据科学智能体在结构化数据和常规分析流程中表现良好,但在非结构化数据领域仍面临重大挑战。最后,我们提出关键见解并规划未来研究方向,以推动数据科学智能体的发展。
English
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.
PDF11January 22, 2026