探索自主代理式數據工程以實現模型專業化

摘要

大型語言模型（LLMs）在通用任務上展現出優異表現，但在缺乏高品質領域特定資料的情況下，往往難以適應專業領域。現有基於LLMs的資料策劃方法主要依賴人工設計的工作流程，尚未探討LLMs能否自主執行端到端的資料工程管線以實現模型專業化。我們正式定義了「自主代理資料工程」（Autonomous Agentic Data Engineering），這是一項新穎的任務，旨在評估計LLMs能否作為自主資料工程師，透過端到端的資料策劃驅動模型專業化。我們將資料視為可優化的組件，並研究能夠規劃、生成並反覆優化跨領域訓練資料的代理，其優化過程以訓練後效能提升為導向。實驗顯示，自主LLM資料工程師能帶來顯著效益：GPT-5.2透過迭代的代理驅動資料適應，建構了一套訓練課程，使學生模型的效能提升57.29%。我們的研究不僅揭示了自主資料工程的潛力與瓶頸，更將其確立為一項可量化的能力，並為代理驅動的模型專業化開創了可行的路徑。程式碼將於 https://github.com/zjunlp/DataAgent 釋出。

English

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..