일반화된 데이터 분석 에이전트의 확장

초록

데이터 분석 에이전트는 자동화된 과학적 발견과 혁신적인 AI 비전을 위한 핵심 촉매제로 부상하고 있습니다. 그러나 현재 접근 방식은 주로 독점 모델에 대한 프롬프트 엔지니어링에 의존하고 있으며, 오픈소스 모델은 다양한 형식의 대규모 데이터 파일과 실제 세계 분석이 요구하는 장기적, 다단계 추론에 대처하는 데 어려움을 겪고 있습니다. 본 논문은 일반적인 데이터 분석 에이전트를 구축하기 위해 설계된 확장 가능한 데이터 합성 및 에이전트 훈련 레시피인 DataMind를 소개합니다. DataMind는 오픈소스 데이터 분석 에이전트 구축에서의 세 가지 주요 과제, 즉 불충분한 데이터 자원, 부적절한 훈련 전략, 불안정한 코드 기반 다중 턴 롤아웃을 해결합니다. 구체적으로, DataMind는 1) 세분화된 작업 분류와 재귀적인 쉬움에서 어려움으로의 작업 구성 메커니즘을 적용하여 합성된 쿼리의 다양성과 난이도를 증가시키고, 2) 지식 증강 궤적 샘플링 전략과 모델 기반 및 규칙 기반 필터링을 수행하며, 3) SFT와 RL 손실을 결합한 동적으로 조정 가능한 훈련 목표를 사용하고, 4) 메모리 효율적이고 안정적인 코드 기반 다중 턴 롤아웃 프레임워크를 제공합니다. DataMind를 기반으로, 우리는 다양한 도메인, 작업 범주, 데이터 파일 형식을 아우르는 고품질 궤적 세트인 DataMind-12K를 구축했습니다. DataMind-12K로 훈련된 우리의 DataMind-14B는 여러 데이터 분석 벤치마크에서 평균 71.16%의 점수를 기록하며, 가장 강력한 독점 베이스라인인 DeepSeek-V3.1과 GPT-5를 능가했습니다. 또한, 우리의 DataMind-7B는 모든 오픈소스 모델 중에서 최고 성능을 보이며 68.10%의 점수를 기록했습니다. 우리는 또한 탐색적 시도에서 얻은 경험적 통찰을 분석 실험에 통합하여 커뮤니티를 위한 실행 가능한 통찰을 제공하고자 합니다. 우리는 DataMind-12K와 DataMind-7B, 14B를 커뮤니티의 미래 연구를 위해 공개할 예정입니다.

English

Data-analytic agents are emerging as a key catalyst for automated scientific discovery and for the vision of Innovating AI. Current approaches, however, rely heavily on prompt engineering over proprietary models, while open-source models struggle to face diverse-format, large-scale data files and long-horizon, multi-step reasoning that real-world analytics demands. This paper introduces DataMind, a scalable data synthesis and agent training recipe designed to build generalist data-analytic agents. DataMind tackles three key challenges in building open-source data-analytic agents, including insufficient data resources, improper training strategy, and unstable code-based multi-turn rollout. Concretely, DataMind applies 1) a fine-grained task taxonomy and a recursive easy-to-hard task composition mechanism to increase the diversity and difficulty of synthesized queries; 2) a knowledge-augmented trajectory sampling strategy followed by model-based and rule-based filtering; 3) a dynamically adjustable training objective combining both SFT and RL losses; 4) a memory-frugal and stable code-based multi-turn rollout framework. Built on DataMind, we curate DataMind-12K, a high-quality trajectory set spanning diverse domains, task categories, and data file formats for data-analytic tasks. Trained on DataMind-12K, our DataMind-14B achieves state-of-the-art with an average score of 71.16% on multiple data analysis benchmarks, outperforming the strongest proprietary baselines DeepSeek-V3.1 and GPT-5. Our DataMind-7B also performs best among all open-source models with a score of 68.10%. We also incorporate some empirical insights gained from our exploratory trials into the analysis experiments, aiming to provide actionable insights about agentic training for the community. We will release DataMind-12K and DataMind-7B,14B for the community's future research.