ChatPaper.aiChatPaper

MLE-Dojo:赋能LLM代理的机器学习工程交互环境

MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering

May 12, 2025
作者: Rushi Qiang, Yuchen Zhuang, Yinghao Li, Dingu Sagar V K, Rongzhi Zhang, Changhao Li, Ian Shu-Hei Wong, Sherry Yang, Percy Liang, Chao Zhang, Bo Dai
cs.AI

摘要

我們推出MLE-Dojo,這是一個Gym風格的框架,旨在系統性地進行強化學習、評估並改進自主大型語言模型(LLM)代理在迭代式機器學習工程(MLE)工作流程中的表現。與現有主要依賴靜態數據集或單次評估的基準不同,MLE-Dojo提供了一個互動環境,使代理能夠通過結構化的反饋循環進行迭代實驗、調試和優化解決方案。基於200多個真實世界的Kaggle挑戰,MLE-Dojo涵蓋了多樣化且開放式的MLE任務,這些任務經過精心策劃,以反映數據處理、架構搜索、超參數調優和代碼調試等現實工程場景。其完全可執行的環境支持通過監督微調和強化學習進行全面的代理訓練,促進迭代實驗、真實數據採樣和實時結果驗證。對八個前沿LLM的廣泛評估表明,雖然當前模型在迭代改進方面取得了有意義的進展,但在自主生成長期解決方案和高效解決複雜錯誤方面仍存在顯著限制。此外,MLE-Dojo靈活且可擴展的架構無縫整合了多樣化的數據源、工具和評估協議,獨特地支持基於模型的代理調優,並促進互操作性、可擴展性和可重現性。我們開源了我們的框架和基準,以促進社區驅動的創新,推動下一代MLE代理的發展。
English
We introduce MLE-Dojo, a Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents in iterative machine learning engineering (MLE) workflows. Unlike existing benchmarks that primarily rely on static datasets or single-attempt evaluations, MLE-Dojo provides an interactive environment enabling agents to iteratively experiment, debug, and refine solutions through structured feedback loops. Built upon 200+ real-world Kaggle challenges, MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios such as data processing, architecture search, hyperparameter tuning, and code debugging. Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning, facilitating iterative experimentation, realistic data sampling, and real-time outcome verification. Extensive evaluations of eight frontier LLMs reveal that while current models achieve meaningful iterative improvements, they still exhibit significant limitations in autonomously generating long-horizon solutions and efficiently resolving complex errors. Furthermore, MLE-Dojo's flexible and extensible architecture seamlessly integrates diverse data sources, tools, and evaluation protocols, uniquely enabling model-based agent tuning and promoting interoperability, scalability, and reproducibility. We open-source our framework and benchmarks to foster community-driven innovation towards next-generation MLE agents.

Summary

AI-Generated Summary

PDF132May 16, 2025