Autodata: 高品質な合成データを生成するエージェント型データサイエンティスト

要旨

本稿では、AIエージェントがデータサイエンティストとして振る舞い、高品質な学習用および評価用データを構築するための汎用手法であるAutodataを紹介する。このデータサイエンティストエージェントを学習（メタ最適化）することで、さらに強力なデータを生成する方法を示す。全体の定式化と、具体的な実装であるエージェンティック・セルフインストラクト（Agentic Self-Instruct）について述べる。計算機科学の研究課題、法的推論課題、数学的対象を用いた推論課題において実験を行い、従来の合成データセット作成手法と比較して改善された結果を得た。さらに、データサイエンティストエージェント自体をメタ最適化することで、より大きな性能向上が達成される。エージェンティックなデータ作成は、推論計算の増加をより高品質なモデル学習に変換する方法を提供する。全体として、この方向性はAIデータの構築方法を変革する可能性を秘めていると考える。

English

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.