透過合成任務擴展實現人工智慧科學家
AI Scientist via Synthetic Task Scaling
March 17, 2026
作者: Ziyang Cai, Harkirat Behl
cs.AI
摘要
随着AI智能体的兴起,自动化科学发现已成为可企及的目标。近期许多研究构建了能够进行机器学习研究的智能体系统,但缺乏系统化的训练方法——当前大语言模型常生成看似合理却无效的方案。为推进智能体的实践学习能力,我们开发了针对机器学习智能体的新型合成环境生成流程。该流程能自动生成兼容SWE-agent框架的机器学习挑战任务,涵盖主题采样、数据集提案和代码生成三大环节。生成的合成任务具有两大特性:1)基于真实机器学习数据集,所有提案数据集均通过Huggingface API验证;2)通过自调试循环确保更高质量。为验证合成任务的有效性,我们在机器学习基准测试MLGym上进行评估。我们从教师模型(GPT-5)中采样合成任务轨迹,进而训练学生模型(Qwen3-4B与Qwen3-8B)。实验表明,采用合成任务训练的学生模型在MLGym上表现显著提升:Qwen3-4B的AUP指标提高9%,Qwen3-8B提升12%。
English
With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.