MLE-Dojo:赋能LLM代理的机器学习工程交互环境
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering
May 12, 2025
作者: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
cs.AI
摘要
我们推出MLE-Dojo,一个Gym风格的框架,旨在系统地进行强化学习、评估和改进自主大型语言模型(LLM)代理在迭代机器学习工程(MLE)工作流程中的表现。与现有主要依赖静态数据集或单次评估的基准不同,MLE-Dojo提供了一个互动环境,使代理能够通过结构化的反馈循环迭代实验、调试和优化解决方案。基于200多个真实世界的Kaggle挑战构建,MLE-Dojo涵盖了多样化的开放式MLE任务,这些任务经过精心策划,以反映数据预处理、架构搜索、超参数调优和代码调试等实际工程场景。其完全可执行的环境支持通过监督微调和强化学习进行全面的代理训练,促进迭代实验、真实数据采样和实时结果验证。对八个前沿LLM的广泛评估表明,尽管当前模型实现了有意义的迭代改进,但在自主生成长期解决方案和高效解决复杂错误方面仍存在显著局限。此外,MLE-Dojo灵活且可扩展的架构无缝集成了多种数据源、工具和评估协议,独特地支持基于模型的代理调优,并促进了互操作性、可扩展性和可重复性。我们开源了该框架和基准,以促进社区驱动的创新,推动下一代MLE代理的发展。
English
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement
learning, evaluating, and improving autonomous large language model (LLM)
agents in iterative machine learning engineering (MLE) workflows. Unlike
existing benchmarks that primarily rely on static datasets or single-attempt
evaluations, MLE-Dojo provides an interactive environment enabling agents to
iteratively experiment, debug, and refine solutions through structured feedback
loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse,
open-ended MLE tasks carefully curated to reflect realistic engineering
scenarios such as data processing, architecture search, hyperparameter tuning,
and code debugging. Its fully executable environment supports comprehensive
agent training via both supervised fine-tuning and reinforcement learning,
facilitating iterative experimentation, realistic data sampling, and real-time
outcome verification. Extensive evaluations of eight frontier LLMs reveal that
while current models achieve meaningful iterative improvements, they still
exhibit significant limitations in autonomously generating long-horizon
solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's
flexible and extensible architecture seamlessly integrates diverse data
sources, tools, and evaluation protocols, uniquely enabling model-based agent
tuning and promoting interoperability, scalability, and reproducibility. We
open-source our framework and benchmarks to foster community-driven innovation
towards next-generation MLE agents.Summary
AI-Generated Summary