通过合成任务扩展实现AI科学家

摘要

随着AI智能体的兴起，自动化科学发现已成为可实现的目标。近期许多研究构建了能够执行机器学习研究的智能体系统，但缺乏训练此类智能体的系统性方法——当前大语言模型常生成看似合理却无效的方案。为推进智能体的实践学习能力，我们开发了针对机器学习智能体的新型合成环境生成流程。该流程能自动生成兼容SWE-agent框架的机器学习挑战任务，涵盖主题采样、数据集提案和代码生成三大模块。生成的合成任务具有两大特性：1）基于真实机器学习数据集，通过Huggingface API验证数据集有效性；2）通过自调试循环确保更高质量。为验证合成任务的有效性，我们在机器学习基准测试MLGym上进行评估：先从教师模型（GPT-5）对合成任务进行轨迹采样，再用这些轨迹训练学生模型（Qwen3-4B与Qwen3-8B）。实验表明，使用合成任务训练的学生模型在MLGym上表现显著提升，Qwen3-4B的AUP指标提高9%，Qwen3-8B提升12%。

English

With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.