Autodata：一种创建高质量合成数据的智能体数据科学家

摘要

我们提出了Autodata，这是一种通用方法，使人工智能代理能够扮演数据科学家的角色，构建高质量的训练和评估数据。我们展示了如何训练（元优化）这样一个数据科学家代理，使其学会创建更强大的数据。我们描述了总体框架以及一个具体的实际实现——Agentic Self-Instruct（智能体自我指令）。我们在计算机科学研究任务、法律推理任务以及数学对象推理任务上进行了实验，与经典的合成数据集创建方法相比，我们取得了改进的结果。此外，对数据科学家代理本身进行元优化带来了更大的性能提升。智能体数据创建提供了一种将增加的推理计算转化为更高质量模型训练的方法。总体而言，我们相信这一方向有潜力改变我们构建人工智能数据的方式。

English

We introduce Autodata, a general method that enables AI agents to act as data scientists who build high quality training and evaluation data. We show how to train (meta-optimize) such a data scientist agent, so that it learns to create even stronger data. We describe the overall formulation, and a specific practical implementation, Agentic Self-Instruct. We conduct experiments on computer science research tasks, legal reasoning tasks and reasoning with mathematical objects, where we obtain improved results compared to classical synthetic dataset creation methods. Further, meta-optimizing the data scientist agent itself delivers an even larger performance uplift. Agentic data creation provides a way to convert increased inference compute into higher quality model training. Overall, we believe this direction has the potential to change the way we build AI data.