汎用エージェントはデータキュレーションを自動化できるだろうか。

要旨

トレーニングデータのキュレーションは、現代のAI開発において最も重要でありながら労働集約的な部分の一つである。実践者は、ノイズの多いベンチマークフィードバックに基づいて、データポリシーを反復的に提案、実装、評価、修正している。我々は、汎用的なコーディングエージェントがこのデータキュレーションループを自動化できるかどうかを問う。本稿では、モデル、トレーニングレシピ、評価スイートを固定し、エージェントにデータの検査、ポリシーの実装、固定されたトレーニング/評価パイプラインへの提出、そして修正を可能とするコマンドラインアクセスを与える、エージェント中心のベンチマーク*Curation-Bench*を導入する。視覚言語命令チューニングの実装において、標準状態のエージェントは10回の反復内で強力な公開データ選択ベースラインに到達する。しかし、軌跡分析は持続的な「実行-研究ギャップ」を明らかにしている。すなわち、エージェントは戦略ガイドや論文参照を与えられても、新しいポリシーファミリーを探求するのではなく、主に局所的なポリシーバリアントを調整している。各反復で先行手法を引用、具体化、適応することを要求するスキャフォールドは、エージェントを手法誘導型の探求へとシフトさせる。スキャフォールドされたエージェントは、人間の設計入力なしに、公開された強力なベースラインをデータ予算の10分の1で上回るデータ選択ポリシーを自律的に構成する。全体として、現在のエージェントはキュレーションループを実行できるが、信頼性の高いデータ研究には、オープンエンドのプロンプティングだけでなく、スキャフォールドされた手法の適応が必要である。コードとベンチマークはオープンソース化されている。

English

Curating training data is among the most consequential yet labor-intensive parts of modern AI development: practitioners iteratively propose, implement, evaluate, and revise data policies against noisy benchmark feedback. We ask whether generalist coding agents can automate this data-curation loop. We introduce *Curation-Bench*, an agent-centric benchmark that fixes the model, training recipe, and evaluation suite while giving agents command-line access to inspect data, implement policies, submit them to a fixed training/evaluation pipeline, and revise. In a vision-language instruction-tuning instantiation, out-of-the-box agents reach strong published data-selection baselines within ten iterations. However, trajectory analysis reveals a persistent *execution-research gap*: agents mainly tune local policy variants rather than explore new policy families, even when given strategy guides and paper references. Scaffolds requiring each iteration to cite, instantiate, and adapt a prior method shift agents toward method-guided exploration. The scaffolded agent autonomously composes -- without human design input -- a data-selection policy that outperforms strong published baselines at one-tenth their data budget. Overall, current agents can run the curation loop, but reliable data research requires scaffolded method adaptation, not open-ended prompting alone. Code and benchmark are open-sourced.