通用型代理能否自动化数据策管？

摘要

在現代人工智慧開發過程中，訓練資料的策選是最具影響力卻也最耗費人力的一環：實務工作者需反覆提出、實作、評估並根據含雜訊的基準回饋來修正資料政策。我們探討通用型編碼代理是否能自動化此資料策選循環。為此，我們提出 *Curation-Bench*，一個以代理為核心的基準評測，固定模型、訓練配方與評估套件，同時賦予代理命令列權限，使其能檢視資料、實作政策、提交至固定的訓練/評估流程，並進行修正。在視覺語言指令微調的實例中，現成的代理能在十次迭代內達到已發表的強力資料篩選基準。然而，軌跡分析顯示存在持續的「執行-研究落差」：代理主要調整局部策略變體，而非探索全新的策略家族，即使已提供策略指南與論文參考亦然。要求每次迭代引用、具體實例化並改編既有方法的支架，能促使代理轉向以方法為導向的探索。經支架輔助的代理在無人類設計輸入下，自主組合出一套資料篩選策略，其表現超越已發表的強力基準，卻僅需其十分之一的資料預算。總體而言，現有代理能執行策選循環，但可靠的資料研究需要支架式的方法改編，而非僅依賴開放式提示。程式碼與基準評測已開源釋出。

English

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.