ChatPaper.aiChatPaper

DataFlow:数据为中心AI时代下基于大语言模型的统一数据准备与工作流自动化框架

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

December 18, 2025
作者: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang
cs.AI

摘要

大型语言模型(LLM)对高质量数据需求的快速增长,使得对可扩展、可靠且语义丰富的数据准备流程的需求日益迫切。然而,当前实践仍以临时脚本和松散定义的工作流为主,这些方法缺乏系统性抽象、阻碍可复现性,并对模型参与式数据生成的支持有限。为应对这些挑战,我们提出DataFlow——一个统一且可扩展的LLM驱动数据准备框架。该框架采用系统级抽象设计,支持模块化、可复用、可组合的数据转换,并提供类PyTorch的流程构建API,用于构建可调试与可优化的数据流。框架包含近200个可复用算子及六大跨领域流程,涵盖文本、数学推理、代码、Text-to-SQL、智能体RAG和大规模知识抽取场景。为提升易用性,我们引入DataFlow-Agent,通过算子合成、流程规划与迭代验证,将自然语言描述自动转换为可执行流水线。在六大典型应用场景中,DataFlow持续提升下游LLM性能:数学、代码及文本流程在效果上超越人工精标数据集与专业合成基线——Text-to-SQL任务执行准确率较SynSQL提升最高达3%,代码基准测试平均提升7%,MATH、GSM8K和AIME任务获得1-3分增益。此外,由DataFlow生成的统一万条样本数据集,使基础模型性能超越基于百万条Infinity-Instruct数据训练的对照模型。这些结果表明,DataFlow为可靠、可复现、可扩展的LLM数据准备提供了实用高效的基础架构,并为未来以数据为中心的人工智能发展奠定了系统级基石。
English
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
PDF1584December 24, 2025