扩展通用数据分析智能体
Scaling Generalist Data-Analytic Agents
September 29, 2025
作者: Shuofei Qiao, Yanqiu Zhao, Zhisong Qiu, Xiaobin Wang, Jintian Zhang, Zhao Bin, Ningyu Zhang, Yong Jiang, Pengjun Xie, Fei Huang, Huajun Chen
cs.AI
摘要
数据分析智能体正成为自动化科学发现与创新AI愿景的关键推动力。然而,现有方法严重依赖对专有模型的提示工程,而开源模型在处理多样格式的大规模数据文件及满足现实世界分析所需的长期、多步推理方面仍面临挑战。本文介绍了DataMind,一种可扩展的数据合成与智能体训练方案,旨在构建通用型数据分析智能体。DataMind针对构建开源数据分析智能体中的三大关键难题:数据资源不足、训练策略不当及基于代码的多轮执行不稳定,提出了解决方案。具体而言,DataMind采用:1)细粒度任务分类与递归式由易到难任务组合机制,以提升合成查询的多样性与难度;2)知识增强的轨迹采样策略,辅以模型与规则双重过滤;3)结合SFT与RL损失的动态可调训练目标;4)内存高效且稳定的基于代码的多轮执行框架。基于DataMind,我们构建了DataMind-12K,一个涵盖多领域、多任务类别及多种数据文件格式的高质量轨迹集,专为数据分析任务设计。在DataMind-12K上训练的DataMind-14B,在多项数据分析基准测试中平均得分达71.16%,超越了最强的专有基线DeepSeek-V3.1与GPT-5。我们的DataMind-7B同样在开源模型中表现最佳,得分为68.10%。此外,我们将探索性试验中获得的实证见解融入分析实验,旨在为社区提供关于智能体训练的可操作洞见。我们将向社区发布DataMind-12K及DataMind-7B、14B,以支持未来的研究。
English
Data-analytic agents are emerging as a key catalyst for automated scientific
discovery and for the vision of Innovating AI. Current approaches, however,
rely heavily on prompt engineering over proprietary models, while open-source
models struggle to face diverse-format, large-scale data files and
long-horizon, multi-step reasoning that real-world analytics demands. This
paper introduces DataMind, a scalable data synthesis and agent training recipe
designed to build generalist data-analytic agents. DataMind tackles three key
challenges in building open-source data-analytic agents, including insufficient
data resources, improper training strategy, and unstable code-based multi-turn
rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a
recursive easy-to-hard task composition mechanism to increase the diversity and
difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling
strategy followed by model-based and rule-based filtering; 3) a dynamically
adjustable training objective combining both SFT and RL losses; 4) a
memory-frugal and stable code-based multi-turn rollout framework. Built on
DataMind, we curate DataMind-12K, a high-quality trajectory set spanning
diverse domains, task categories, and data file formats for data-analytic
tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with
an average score of 71.16% on multiple data analysis benchmarks, outperforming
the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B
also performs best among all open-source models with a score of 68.10%. We also
incorporate some empirical insights gained from our exploratory trials into the
analysis experiments, aiming to provide actionable insights about agentic
training for the community. We will release DataMind-12K and DataMind-7B,14B
for the community's future research.