ChatPaper.aiChatPaper

DSAEval:基于广泛现实数据科学问题的数据科学智能体评估体系

DSAEval: Evaluating Data Science Agents on a Wide Range of Real-World Data Science Problems

January 20, 2026
作者: Maojun Sun, Yifei Xie, Yue Wu, Ruijian Han, Binyan Jiang, Defeng Sun, Yancheng Yuan, Jian Huang
cs.AI

摘要

当前基于大语言模型的数据智能体致力于实现从数据分析到深度学习的数据科学任务自动化。然而,现实数据科学问题具有开放性的特点——常跨越多种分类体系且缺乏标准答案,这为评估工作带来巨大挑战。为此,我们推出DSAEval基准测试,该基准包含基于285个多样化数据集的641个真实数据科学问题,涵盖结构化与非结构化数据(如视觉与文本数据)。DSAEval具有三大特色:(1) 多模态环境感知能力,使智能体能够解读文本、视觉等多模态观察信息;(2) 多轮次交互机制,模拟现实数据科学项目中迭代累积的工作特性;(3) 多维度评估体系,从推理过程、代码实现与结果输出三个维度进行综合评判。我们使用DSAEval对11种先进的大模型智能体进行系统评估。结果表明:Claude-Sonnet-4.5综合表现最强,GPT-5.2效率最优,MiMo-V2-Flash性价比最高。研究进一步证实,多模态感知能持续提升视觉相关任务性能,改善幅度达2.04%至11.30%。总体而言,当前数据科学智能体在结构化数据和常规分析流程中表现良好,但在非结构化数据领域仍面临重大挑战。最后,我们提出关键见解并规划未来研究方向,以推动数据科学智能体的发展。
English
Recent LLM-based data agents aim to automate data science tasks ranging from data analysis to deep learning. However, the open-ended nature of real-world data science problems, which often span multiple taxonomies and lack standard answers, poses a significant challenge for evaluation. To address this, we introduce DSAEval, a benchmark comprising 641 real-world data science problems grounded in 285 diverse datasets, covering both structured and unstructured data (e.g., vision and text). DSAEval incorporates three distinctive features: (1) Multimodal Environment Perception, which enables agents to interpret observations from multiple modalities including text and vision; (2) Multi-Query Interactions, which mirror the iterative and cumulative nature of real-world data science projects; and (3) Multi-Dimensional Evaluation, which provides a holistic assessment across reasoning, code, and results. We systematically evaluate 11 advanced agentic LLMs using DSAEval. Our results show that Claude-Sonnet-4.5 achieves the strongest overall performance, GPT-5.2 is the most efficient, and MiMo-V2-Flash is the most cost-effective. We further demonstrate that multimodal perception consistently improves performance on vision-related tasks, with gains ranging from 2.04% to 11.30%. Overall, while current data science agents perform well on structured data and routine data anlysis workflows, substantial challenges remain in unstructured domains. Finally, we offer critical insights and outline future research directions to advance the development of data science agents.
PDF11January 22, 2026