深度分析:面向自主数据科学的智能大语言模型
DeepAnalyze: Agentic Large Language Models for Autonomous Data Science
October 19, 2025
作者: Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du
cs.AI
摘要
从原始数据源到分析师级别的深度研究报告,自主数据科学长期以来一直是一项挑战,而随着强大大型语言模型(LLMs)的出现,这一目标正变得可行。近期基于工作流的数据代理在特定数据任务上展现出良好效果,但由于依赖预定义工作流,它们在实现完全自主数据科学方面仍存在根本性局限。本文介绍DeepAnalyze-8B,首个专为自主数据科学设计的代理型LLM,能够自动完成从数据源到分析师级深度研究报告的端到端流程。为应对高复杂度数据科学任务,我们提出了一种基于课程学习的代理训练范式,模拟人类数据科学家的学习轨迹,使LLM能够在真实环境中逐步掌握并整合多种能力。我们还引入了一种数据驱动的轨迹合成框架,用于构建高质量训练数据。通过代理训练,DeepAnalyze学会了执行广泛的数据任务,包括数据问答、专业分析任务以及开放式数据研究。实验表明,仅拥有8B参数的DeepAnalyze在性能上超越了基于最先进专有LLM构建的先前工作流代理。DeepAnalyze的模型、代码及训练数据均已开源,为自主数据科学的发展铺平了道路。
English
Autonomous data science, from raw data sources to analyst-grade deep research
reports, has been a long-standing challenge, and is now becoming feasible with
the emergence of powerful large language models (LLMs). Recent workflow-based
data agents have shown promising results on specific data tasks but remain
fundamentally limited in achieving fully autonomous data science due to their
reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B,
the first agentic LLM designed for autonomous data science, capable of
automatically completing the end-toend pipeline from data sources to
analyst-grade deep research reports. To tackle high-complexity data science
tasks, we propose a curriculum-based agentic training paradigm that emulates
the learning trajectory of human data scientists, enabling LLMs to
progressively acquire and integrate multiple capabilities in real-world
environments. We also introduce a data-grounded trajectory synthesis framework
that constructs high-quality training data. Through agentic training,
DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data
question answering and specialized analytical tasks to open-ended data
research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze
outperforms previous workflow-based agents built on most advanced proprietary
LLMs. The model, code, and training data of DeepAnalyze are open-sourced,
paving the way toward autonomous data science.