ChatPaper.aiChatPaper

DSGym:面向数据科学智能体评估与训练的一体化框架

DSGym: A Holistic Framework for Evaluating and Training Data Science Agents

January 22, 2026
作者: Fan Nie, Junlin Wang, Harper Hua, Federico Bianchi, Yongchan Kwon, Zhenting Qi, Owen Queen, Shang Zhu, James Zou
cs.AI

摘要

数据科学智能体承诺通过将数据转化为可执行的分析与发现,从而加速科学发现和洞察生成。然而现有数据科学基准测试因存在评估接口碎片化导致跨基准对比困难、任务覆盖范围狭窄、缺乏严谨数据基础等缺陷。我们特别指出,当前基准测试中相当比例的任务无需使用真实数据即可解决。为突破这些局限,我们推出DSGym——一个在自包含执行环境中评估和训练数据科学智能体的标准化框架。与静态基准不同,DSGym采用模块化架构,可便捷添加任务、智能体框架和工具,使其成为可动态扩展的测试平台。我们精心构建了DSGym-Tasks综合任务套件,通过质量筛选和捷径可解性过滤对现有基准进行标准化优化。通过以下方式进一步拓展覆盖范围:(1) DSBio:基于文献构建的专家级生物信息学任务;(2) DSPredict:涵盖计算机视觉、分子预测和单细胞扰动等领域的挑战性预测任务。除评估功能外,DSGym还通过执行验证的数据合成流程支持智能体训练。作为案例研究,我们构建了包含2,000个样本的训练集,在DSGym中训练出的40亿参数模型在标准化分析基准上超越了GPT-4o的表现。总体而言,DSGym实现了对智能体能否在真实科学场景中规划、实施和验证数据分析的严格端到端评估。
English
Data science agents promise to accelerate discovery and insight-generation by turning data into executable analyses and findings. Yet existing data science benchmarks fall short due to fragmented evaluation interfaces that make cross-benchmark comparison difficult, narrow task coverage and a lack of rigorous data grounding. In particular, we show that a substantial portion of tasks in current benchmarks can be solved without using the actual data. To address these limitations, we introduce DSGym, a standardized framework for evaluating and training data science agents in self-contained execution environments. Unlike static benchmarks, DSGym provides a modular architecture that makes it easy to add tasks, agent scaffolds, and tools, positioning it as a live, extensible testbed. We curate DSGym-Tasks, a holistic task suite that standardizes and refines existing benchmarks via quality and shortcut solvability filtering. We further expand coverage with (1) DSBio: expert-derived bioinformatics tasks grounded in literature and (2) DSPredict: challenging prediction tasks spanning domains such as computer vision, molecular prediction, and single-cell perturbation. Beyond evaluation, DSGym enables agent training via execution-verified data synthesis pipeline. As a case study, we build a 2,000-example training set and trained a 4B model in DSGym that outperforms GPT-4o on standardized analysis benchmarks. Overall, DSGym enables rigorous end-to-end measurement of whether agents can plan, implement, and validate data analyses in realistic scientific context.
PDF81January 27, 2026