探索自主代理式數據工程以實現模型專業化
Exploring Autonomous Agentic Data Engineering for Model Specialization
May 28, 2026
作者: Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng
cs.AI
摘要
大型語言模型(LLMs)在通用任務上展現出優異表現,但在缺乏高品質領域特定資料的情況下,往往難以適應專業領域。現有基於LLMs的資料策劃方法主要依賴人工設計的工作流程,尚未探討LLMs能否自主執行端到端的資料工程管線以實現模型專業化。我們正式定義了「自主代理資料工程」(Autonomous Agentic Data Engineering),這是一項新穎的任務,旨在評估計LLMs能否作為自主資料工程師,透過端到端的資料策劃驅動模型專業化。我們將資料視為可優化的組件,並研究能夠規劃、生成並反覆優化跨領域訓練資料的代理,其優化過程以訓練後效能提升為導向。實驗顯示,自主LLM資料工程師能帶來顯著效益:GPT-5.2透過迭代的代理驅動資料適應,建構了一套訓練課程,使學生模型的效能提升57.29%。我們的研究不僅揭示了自主資料工程的潛力與瓶頸,更將其確立為一項可量化的能力,並為代理驅動的模型專業化開創了可行的路徑。程式碼將於 https://github.com/zjunlp/DataAgent 釋出。
English
Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..