擴展通用型數據分析代理

摘要

數據分析代理正逐漸成為自動化科學發現和創新人工智慧願景的關鍵催化劑。然而，當前的方法過度依賴於對專有模型進行提示工程，而開源模型則難以應對現實世界分析所需求的多樣化格式、大規模數據文件以及長時序、多步驟的推理。本文介紹了DataMind，這是一種可擴展的數據合成與代理訓練方案，旨在構建通用的數據分析代理。DataMind針對構建開源數據分析代理面臨的三個關鍵挑戰，包括數據資源不足、訓練策略不當以及基於代碼的多輪執行不穩定。具體而言，DataMind採用了：1）細粒度的任務分類與遞進式易到難任務組合機制，以提升合成查詢的多樣性與難度；2）知識增強的軌跡採樣策略，隨後進行基於模型和規則的過濾；3）結合SFT與RL損失的動態可調訓練目標；4）內存節省且穩定的基於代碼的多輪執行框架。基於DataMind，我們精心製作了DataMind-12K，這是一個涵蓋多領域、任務類別及數據文件格式的高質量軌跡集，專為數據分析任務設計。在DataMind-12K上訓練的DataMind-14B在多個數據分析基準測試中取得了71.16%的平均分，超越了最強的專有基線DeepSeek-V3.1和GPT-5。我們的DataMind-7B同樣在所有開源模型中表現最佳，得分為68.10%。我們還將探索性試驗中獲得的經驗見解融入分析實驗，旨在為社區提供關於代理訓練的可操作見解。我們將向社區發布DataMind-12K及DataMind-7B、14B，以供未來研究之用。

English

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.