ChatPaper.aiChatPaper

通用型智能体能否自动化数据策管?

Can Generalist Agents Automate Data Curation?

June 2, 2026
作者: Feiyang Kang, Hanze Li, Adam Nguyen, Mahavir Dabas, Jiaqi W. Ma, Frederic Sala, Dawn Song, Ruoxi Jia
cs.AI

摘要

训练数据策展是现代AI开发中最关键却也最劳动密集的环节之一:从业者需根据嘈杂的基准反馈,迭代提出、实施、评估并修订数据策略。我们探究通用型编码智能体能否自动化这一数据策展循环。为此,我们推出了*Curation-Bench*——一个以智能体为核心的基准测试平台,该平台固定模型、训练方案及评估套件,同时赋予智能体命令行权限,使其能够检查数据、实施策略、将策略提交至固定的训练/评估流程并进行修订。在视觉-语言指令微调的具体实例中,现成的智能体在十次迭代内便达到了已发表的高基准数据选择水平。然而,轨迹分析揭示了持续的*执行-研究鸿沟*:即便提供了策略指南和论文参考文献,智能体仍主要调整局部策略变体,而非探索新的策略家族。要求每次迭代必须引用、实例化并适配先前方法的脚手架结构,促使智能体转向方法引导的探索。经过脚手架结构辅助的智能体自主组合——无需人类设计输入——形成了一种数据选择策略,该策略在仅使用十分之一数据预算的条件下,超越了已发表的强劲基准。总体而言,当前智能体能够运行策展循环,但可靠的数据研究需要带脚手架结构的方法适配,而非仅依赖开放式提示。相关代码与基准测试已开源。
English
Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.