通用型代理能否自动化数据策管?
Can Generalist Agents Automate Data Curation?
June 2, 2026
作者: Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia
cs.AI
摘要
在現代人工智慧開發過程中,訓練資料的策選是最具影響力卻也最耗費人力的一環:實務工作者需反覆提出、實作、評估並根據含雜訊的基準回饋來修正資料政策。我們探討通用型編碼代理是否能自動化此資料策選循環。為此,我們提出 *Curation-Bench*,一個以代理為核心的基準評測,固定模型、訓練配方與評估套件,同時賦予代理命令列權限,使其能檢視資料、實作政策、提交至固定的訓練/評估流程,並進行修正。在視覺語言指令微調的實例中,現成的代理能在十次迭代內達到已發表的強力資料篩選基準。然而,軌跡分析顯示存在持續的「執行-研究落差」:代理主要調整局部策略變體,而非探索全新的策略家族,即使已提供策略指南與論文參考亦然。要求每次迭代引用、具體實例化並改編既有方法的支架,能促使代理轉向以方法為導向的探索。經支架輔助的代理在無人類設計輸入下,自主組合出一套資料篩選策略,其表現超越已發表的強力基準,卻僅需其十分之一的資料預算。總體而言,現有代理能執行策選循環,但可靠的資料研究需要支架式的方法改編,而非僅依賴開放式提示。程式碼與基準評測已開源釋出。
English
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.