MLE-Smith: 자동화된 다중 에이전트 파이프라인을 통한 MLE 작업 확장

초록

언어 모델(LMs)이 기계 학습 엔지니어링(MLE) 자동화에서 상당한 진전을 이루었음에도 불구하고, 고품질 MLE 학습 데이터의 획득은 여전히 큰 제약을 받고 있습니다. 현재의 MLE 벤치마크는 확장성이 낮고 적용 범위가 제한적이며, 이는 정적이고 수동으로 선별된 작업에 의존하기 때문입니다. 이러한 작업을 생산하기 위해서는 상당한 시간과 수동 노력이 필요합니다. 우리는 MLE-Smith를 소개합니다. 이는 완전히 자동화된 다중 에이전트 파이프라인으로, 원시 데이터셋을 경쟁 스타일의 MLE 도전 과제로 변환하기 위해 효율적인 생성-검증-실행 패러다임을 사용하여 검증 가능한 품질, 실세계 유용성, 그리고 풍부한 다양성을 갖춘 MLE 작업을 확장합니다. MLE-Smith에서 제안된 다중 에이전트 파이프라인은 구조화된 작업 설계와 표준화된 리팩토링을 주도하며, 엄격한 구조적 규칙과 높은 수준의 의미적 타당성을 강제하는 하이브리드 검증 메커니즘과 결합됩니다. 또한, 상호작용적 실행을 통해 경험적 해결 가능성과 실세계 충실도를 추가로 검증합니다. 우리는 MLE-Smith를 224개의 실세계 데이터셋에 적용하여 다양한 범주, 목표, 그리고 모달리티를 아우르는 606개의 작업을 생성함으로써, MLE-Smith가 다양한 실세계 데이터셋에서 효과적으로 작동할 수 있음을 입증했습니다. 생성된 작업에 대한 평가 결과, MLE-Smith 작업에서 8개의 주류 및 최첨단 LLM의 성능은 신중하게 인간이 설계한 작업에서의 성능과 강한 상관관계를 보였으며, 이는 MLE-Smith가 작업 품질을 유지하면서 MLE 작업을 확장하는 데 효과적임을 강조합니다.

English

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

MLE-Smith: 자동화된 다중 에이전트 파이프라인을 통한 MLE 작업 확장

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

초록

Support