모델 전문화를 위한 자율적 에이전트 데이터 엔지니어링 탐구

초록

대규모 언어 모델(LLM)은 일반 작업에서 뛰어난 성능을 보여주지만, 고품질의 도메인 특화 데이터 없이 specialized 도메인에 적응하는 데는 종종 어려움을 겪는다. 기존의 LLM 기반 데이터 큐레이션 방법은 주로 인간이 설계한 워크플로에 의존하며, LLM이 모델 전문화를 위해 종단간 데이터 엔지니어링 파이프라인을 자율적으로 실행할 수 있는지 여부는 검토되지 않은 상태로 남아 있다. 본 연구에서는 자율 에이전트 데이터 엔지니어링(Autonomous Agentic Data Engineering)이라는 새로운 과제를 공식화하여, LLM이 자율 데이터 엔지니어로서 종단간 데이터 큐레이션을 통해 모델 전문화를 추진하는 능력을 평가한다. 데이터를 최적화 가능한 구성 요소로 간주하고, 학습 후 성능 향상에 따라 여러 도메인에 걸쳐 학습 데이터를 계획, 생성 및 반복적으로 최적화하는 에이전트를 연구한다. 실험 결과, 자율 LLM 데이터 엔지니어가 상당한 성능 향상을 가져오는 것으로 나타났으며, GPT-5.2는 학습 커리큘럼을 구성하여 학생 모델의 성능을 57.29% 향상시켰는데, 이는 전적으로 반복적인 에이전트 기반 데이터 적응을 통해 이루어졌다. 잠재력과 병목 현상을 모두 조명함으로써, 본 연구는 자율 데이터 엔지니어링을 측정 가능한 역량으로 확립하고 에이전트 기반 모델 전문화를 위한 길을 제시한다. 코드는 https://github.com/zjunlp/DataAgent에서 공개될 예정이다.

English

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29\%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specializationCode will be released at https://github.com/zjunlp/DataAgent..