机器学习工程代理训练合成沙盒环境
Synthetic Sandbox for Training Machine Learning Engineering Agents
April 6, 2026
作者: Yuhang Zhou, Lizhu Zhang, Yifan Wu, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao, Hong Yan
cs.AI
摘要
随着大语言模型智能体从软件工程任务向机器学习工程领域拓展,验证智能体行为所需的成本呈数量级增长:软件工程任务可通过快速执行的单元测试进行验证,而机器学习工程验证需要在每次迭代中基于大型数据集运行完整的机器学习流程(包括数据预处理、模型训练和指标评估),导致基于轨迹的在线强化学习方法因耗时过长而难以实施。现有研究大多退而采用监督微调或离线代理奖励,牺牲了在线强化学习的探索与泛化优势。我们发现沙箱数据规模是造成这一瓶颈的主要根源。基于此洞察,我们提出SandMLE多智能体框架,该框架通过少量种子任务生成多样化、可验证的合成机器学习工程环境,在保持现实问题结构复杂性与技术挑战的同时,将数据集规模约束在微尺度(每个任务仅包含50-200个训练样本)。大量实验表明,SandMLE将执行时间缩短超过13倍,首次在机器学习工程领域实现大规模在线轨迹强化学习。在MLE-bench-lite基准测试中,SandMLE在Qwen3-8B、14B和30B-A3B模型上均显著超越监督微调基线,相对奖牌率提升幅度达20.3%至66.9%。此外,经训练的策略在不同智能体架构间展现出卓越泛化能力,在MLE-Dojo评估中人类评分最高提升32.4%。
English
As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.