ChatPaper.aiChatPaper

MLE-Smith:利用自动化多智能体管道扩展MLE任务

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

October 8, 2025
作者: Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai
cs.AI

摘要

儘管語言模型(LMs)在自動化機器學習工程(MLE)方面取得了顯著進展,但高質量MLE訓練數據的獲取仍受到嚴重限制。現有的MLE基準測試因依賴於靜態、手動策劃的任務而存在可擴展性低和適用性有限的問題,這些任務的製作耗時且需要大量人工投入。我們引入了MLE-Smith,這是一個全自動的多代理管道,通過高效的生成-驗證-執行範式,將原始數據集轉化為競賽風格的MLE挑戰,以此來擴展MLE任務,並確保其質量可驗證、現實世界可用性強及多樣性豐富。MLE-Smith中的多代理管道推動了結構化任務設計和標準化重構,結合了混合驗證機制,該機制強制執行嚴格的結構規則和高層次的語義合理性。它還通過交互式執行來驗證經驗上的可解性和現實世界的保真度。我們將MLE-Smith應用於224個現實世界數據集,生成了涵蓋多個類別、目標和模態的606個任務,證明了MLE-Smith能夠在廣泛的現實世界數據集上有效工作。對生成任務的評估顯示,八種主流及前沿LLMs在MLE-Smith任務上的表現與其在精心設計的人類任務上的表現高度相關,這凸顯了MLE-Smith在擴展MLE任務的同時保持任務質量的有效性。
English
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.
PDF52October 9, 2025