ChatPaper.aiChatPaper

深度解析:面向自主数据科学的代理型大规模语言模型

DeepAnalyze: Agentic Large Language Models for Autonomous Data Science

October 19, 2025
作者: Shaolei Zhang, Ju Fan, Meihao Fan, Guoliang Li, Xiaoyong Du
cs.AI

摘要

自主数据科学,从原始数据源到分析师级别的深度研究报告,长期以来一直是一项挑战,而随着强大大型语言模型(LLMs)的出现,这一目标正变得可行。近期基于工作流程的数据代理在特定数据任务上展现了令人鼓舞的成果,但由于其对预设工作流程的依赖,在实现完全自主数据科学方面仍存在根本性限制。本文中,我们介绍了DeepAnalyze-8B,首个专为自主数据科学设计的代理型LLM,能够自动完成从数据源到分析师级别深度研究报告的端到端流程。为应对高复杂度数据科学任务,我们提出了一种基于课程的代理训练范式,模拟人类数据科学家的学习轨迹,使LLMs能够在真实环境中逐步获取并整合多种能力。同时,我们引入了一种数据驱动的轨迹合成框架,用于构建高质量的训练数据。通过代理训练,DeepAnalyze学会了执行广泛的数据任务,包括数据问答、专业分析任务以及开放式数据研究。实验表明,仅拥有80亿参数的DeepAnalyze,其表现超越了以往基于最先进专有LLMs构建的工作流程代理。DeepAnalyze的模型、代码及训练数据均已开源,为迈向自主数据科学铺平了道路。
English
Autonomous data science, from raw data sources to analyst-grade deep research reports, has been a long-standing challenge, and is now becoming feasible with the emergence of powerful large language models (LLMs). Recent workflow-based data agents have shown promising results on specific data tasks but remain fundamentally limited in achieving fully autonomous data science due to their reliance on predefined workflows. In this paper, we introduce DeepAnalyze-8B, the first agentic LLM designed for autonomous data science, capable of automatically completing the end-toend pipeline from data sources to analyst-grade deep research reports. To tackle high-complexity data science tasks, we propose a curriculum-based agentic training paradigm that emulates the learning trajectory of human data scientists, enabling LLMs to progressively acquire and integrate multiple capabilities in real-world environments. We also introduce a data-grounded trajectory synthesis framework that constructs high-quality training data. Through agentic training, DeepAnalyze learns to perform a broad spectrum of data tasks, ranging from data question answering and specialized analytical tasks to open-ended data research. Experiments demonstrate that, with only 8B parameters, DeepAnalyze outperforms previous workflow-based agents built on most advanced proprietary LLMs. The model, code, and training data of DeepAnalyze are open-sourced, paving the way toward autonomous data science.
PDF673October 21, 2025