MLE-Smith: Scalabilità delle attività MLE con una pipeline multi-agente automatizzata

Abstract

Sebbene i Modelli Linguistici (LM) abbiano compiuto progressi significativi nell'automatizzazione dell'ingegneria del machine learning (MLE), l'acquisizione di dati di addestramento MLE di alta qualità è fortemente limitata. Gli attuali benchmark MLE soffrono di scarsa scalabilità e limitata applicabilità poiché si basano su task statici e curati manualmente, richiedendo un notevole tempo e sforzo manuale per essere prodotti. Introduciamo MLE-Smith, una pipeline multi-agente completamente automatizzata, per trasformare dataset grezzi in sfide MLE in stile competizione attraverso un paradigma efficiente di generazione-verifica-esecuzione, finalizzato a scalare i task MLE con qualità verificabile, usabilità nel mondo reale e ampia diversità. La pipeline multi-agente proposta in MLE-Smith guida la progettazione strutturata dei task e il refactoring standardizzato, abbinata a un meccanismo di verifica ibrido che applica regole strutturali rigorose e correttezza semantica di alto livello. Inoltre, valida l'effettiva risolvibilità empirica e la fedeltà al mondo reale attraverso l'esecuzione interattiva. Applichiamo MLE-Smith a 224 dataset del mondo reale e generiamo 606 task che coprono molteplici categorie, obiettivi e modalità, dimostrando che MLE-Smith può funzionare efficacemente su un'ampia gamma di dataset reali. La valutazione sui task generati mostra che le prestazioni di otto LLM mainstream e all'avanguardia sui task di MLE-Smith sono fortemente correlate con le loro prestazioni su task progettati con cura da esseri umani, evidenziando l'efficacia di MLE-Smith nel scalare i task MLE mantenendo la qualità dei task.

English

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

MLE-Smith: Scalabilità delle attività MLE con una pipeline multi-agente automatizzata

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

Abstract

Support