범용 에이전트가 데이터 큐레이션을 자동화할 수 있을까?

초록

훈련 데이터 큐레이션은 현대 AI 개발에서 가장 중요하면서도 노동 집약적인 부분 중 하나이다. 실무자들은 잡음이 섞인 벤치마크 피드백에 반응하여 데이터 정책을 반복적으로 제안, 구현, 평가 및 수정한다. 본 연구에서는 범용 코딩 에이전트가 이러한 데이터 큐레이션 루프를 자동화할 수 있는지 질문한다. 우리는 모델, 훈련 레시피, 평가 스위트를 고정시키고 에이전트에게 데이터 검사, 정책 구현, 고정 훈련/평가 파이프라인에 제출, 수정을 위한 명령줄 접근 권한을 부여하는 에이전트 중심 벤치마크인 *Curation-Bench*를 소개한다. 비전-언어 명령 튜닝 인스턴스화에서, 기본 설정 에이전트는 10회 반복 이내에 강력한 공개 데이터 선택 기준선에 도달한다. 그러나 궤적 분석은 지속적인 *실행-연구 격차*를 드러낸다. 에이전트는 전략 가이드와 논문 참조가 제공되어도 새로운 정책군을 탐색하기보다는 주로 로컬 정책 변형을 조정한다. 각 반복에서 이전 방법을 인용, 인스턴스화 및 적응하도록 요구하는 스캐폴딩은 에이전트를 방법 기반 탐색으로 전환시킨다. 스캐폴딩된 에이전트는 인간의 설계 입력 없이 자율적으로 데이터 예산의 10%만으로 강력한 공개 기준선을 능가하는 데이터 선택 정책을 구성한다. 전반적으로, 현재 에이전트는 큐레이션 루프를 실행할 수 있지만, 신뢰할 수 있는 데이터 연구를 위해서는 개방형 프롬프트만으로는 부족하며 스캐폴딩된 방법 적응이 필요하다. 코드와 벤치마크는 오픈소스로 제공된다.

English

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.