ChatPaper.aiChatPaper

DataFlow:數據中心人工智慧時代下,由大型語言模型驅動的統一數據準備與工作流程自動化框架

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

December 18, 2025
作者: Hao Liang, Xiaochen Ma, Zhou Liu, Zhen Hao Wong, Zhengyang Zhao, Zimo Meng, Runming He, Chengyu Shen, Qifeng Cai, Zhaoyang Han, Meiyi Qiang, Yalin Feng, Tianyi Bai, Zewei Pan, Ziyi Guo, Yizhen Jiang, Jingwen Deng, Qijie You, Peichao Lai, Tianyu Guo, Chi Hsu Tsai, Hengyi Feng, Rui Hu, Wenkai Yu, Junbo Niu, Bohan Zeng, Ruichuan An, Lu Ma, Jihao Huang, Yaowei Zheng, Conghui He, Linpeng Tang, Bin Cui, Weinan E, Wentao Zhang
cs.AI

摘要

大型語言模型對高品質資料的快速增長需求,正急遽提升對可擴展、可靠且語義豐富的資料準備管線的需求。然而,當前實踐仍以臨時腳本和鬆散定義的工作流程為主,這些方法缺乏原則性抽象、阻礙可重現性,且對模型迴圈內資料生成的支援有限。為解決這些挑戰,我們提出DataFlow——一個統一且可擴展的LLM驅動資料準備框架。DataFlow採用系統級抽象設計,能實現模組化、可複用且可組合的資料轉換,並提供PyTorch風格的管線建構API,用於構建可調試與可優化的資料流。該框架包含近200個可複用運算元及六個跨領域通用管線,涵蓋文本、數學推理、程式碼、Text-to-SQL、代理式RAG與大規模知識擷取。為進一步提升易用性,我們引入DataFlow-Agent,透過運算元合成、管線規劃與迭代驗證,自動將自然語言規格轉譯為可執行管線。在六個代表性應用場景中,DataFlow持續提升下游LLM效能:我們的數學、程式碼與文本管線均優於人工精選資料集與專業化合成基準,在Text-to-SQL任務上較SynSQL提升達3%的執行準確率,於程式碼基準測試平均提升7%,並在MATH、GSM8K和AIME上獲得1-3分增益。此外,由DataFlow產生的統一萬樣本資料集,能使基礎模型勝過使用百萬級Infinity-Instruct資料訓練的對照模型。這些結果表明,DataFlow為可靠、可重現且可擴展的LLM資料準備提供了實用高效底層架構,並為未來以資料為核心的AI發展奠定系統級基礎。
English
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.
PDF1584December 24, 2025