汎用データ分析エージェントのスケーリング

要旨

データ分析エージェントは、自動化された科学的発見とイノベーティブAIのビジョンを実現するための重要な触媒として台頭しつつあります。しかし、現在のアプローチは、プロプライエタリモデルに対するプロンプトエンジニアリングに大きく依存しており、オープンソースモデルは、多様なフォーマットの大規模データファイルや、現実世界の分析が要求する長期的で多段階の推論に対応するのに苦戦しています。本論文では、汎用的なデータ分析エージェントを構築するためのスケーラブルなデータ合成とエージェントトレーニングのレシピであるDataMindを紹介します。DataMindは、オープンソースのデータ分析エージェントを構築する際の3つの主要な課題、すなわち不十分なデータリソース、不適切なトレーニング戦略、不安定なコードベースのマルチターン展開に取り組みます。具体的には、DataMindは、1) 細粒度のタスク分類と再帰的な易から難へのタスク構成メカニズムを適用して、合成クエリの多様性と難易度を向上させます。2) 知識拡張された軌道サンプリング戦略と、モデルベースおよびルールベースのフィルタリングを採用します。3) SFTとRLの損失を組み合わせた動的に調整可能なトレーニング目標を設定します。4) メモリ効率が高く安定したコードベースのマルチターン展開フレームワークを提供します。DataMindを基盤として、データ分析タスクのための多様なドメイン、タスクカテゴリ、データファイルフォーマットを網羅した高品質な軌道セットであるDataMind-12Kをキュレーションしました。DataMind-12KでトレーニングされたDataMind-14Bは、複数のデータ分析ベンチマークで平均スコア71.16%を達成し、最強のプロプライエタリベースラインであるDeepSeek-V3.1とGPT-5を上回りました。また、DataMind-7Bも、スコア68.10%で全てのオープンソースモデルの中で最高のパフォーマンスを発揮しました。さらに、探索的試験から得られた経験的知見を分析実験に取り入れ、コミュニティに対してエージェントトレーニングに関する実践的な洞察を提供することを目指しています。DataMind-12KとDataMind-7B、14Bをコミュニティの将来の研究のために公開する予定です。

English

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.

汎用データ分析エージェントのスケーリング

Scaling Generalist Data-Analytic Agents

要旨

Support