モデル特化のための自律エージェント型データエンジニアリングの探求

要旨

大規模言語モデル（LLM）は汎用タスクにおいて高い性能を示す一方、高品質なドメイン固有データなしでは専門領域への適応にしばしば困難を伴う。既存のLLMベースのデータキュレーション手法は主に人手による設計ワークフローに依存しており、LLMがモデル特化のためのエンドツーエンドのデータエンジニアリングパイプラインを自律的に実行できるかどうかは未検討である。本稿では、新たなタスクとして「自律エージェント型データエンジニアリング」を定式化する。これは、LLMを自律的なデータエンジニアとして評価し、エンドツーエンドのデータキュレーションを通じてモデル特化を推進するものである。我々はデータを最適化可能なコンポーネントと捉え、エージェントが複数ドメインにわたってトレーニングデータを計画・生成・反復的に最適化し、訓練後の性能向上に基づいて誘導する機構を研究する。実験の結果、自律型LLMデータエンジニアは顕著な利益をもたらすことが示された。例えば、GPT-5.2は反復的なエージェント駆動型データ適応により、生徒モデルの性能を57.29%向上させるトレーニングカリキュラムを構築した。その可能性とボトルネックの両方を明らかにすることで、本研究は自律型データエンジニアリングを測定可能な能力として確立し、エージェント駆動型モデル特化への道筋を示す。コードはhttps://github.com/zjunlp/DataAgentで公開予定である。

English

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..