ChatPaper.aiChatPaper

MLE-Smith:通过自动化多智能体流水线扩展MLE任务

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

October 8, 2025
作者: Rushi Qiang, Yuchen Zhuang, Anikait Singh, Percy Liang, Chao Zhang, Sherry Yang, Bo Dai
cs.AI

摘要

尽管语言模型(LMs)在自动化机器学习工程(MLE)方面取得了显著进展,但高质量MLE训练数据的获取仍受到极大限制。当前的MLE基准测试因依赖静态、手动策划的任务而存在可扩展性低、适用性有限的问题,这些任务的制作耗时且需大量人工投入。为此,我们引入了MLE-Smith,一个全自动的多智能体流程,通过高效的生成-验证-执行范式,将原始数据集转化为竞赛风格的MLE挑战,旨在实现MLE任务的可扩展性,同时确保任务质量的可验证性、现实世界的实用性及丰富的多样性。MLE-Smith中的多智能体流程推动了结构化任务设计与标准化重构,结合混合验证机制,严格实施结构规则与高层次语义合理性,并通过交互式执行进一步验证了任务的实际可解性与现实世界的保真度。我们将MLE-Smith应用于224个真实世界数据集,生成了涵盖多种类别、目标及模态的606项任务,证明MLE-Smith能在广泛的实际数据集上有效工作。对生成任务的评估显示,八种主流及前沿大语言模型在MLE-Smith任务上的表现与它们在精心设计的人工任务上的表现高度相关,凸显了MLE-Smith在扩大MLE任务规模的同时保持任务质量的有效性。
English
While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.
PDF52October 9, 2025