MLE-Smith: 自動化マルチエージェントパイプラインによるMLEタスクのスケーリング

要旨

言語モデル（LMs）は機械学習エンジニアリング（MLE）の自動化において大きな進展を遂げているが、高品質なMLEトレーニングデータの取得は依然として大きな制約を受けている。現在のMLEベンチマークは、静的で手動でキュレーションされたタスクに依存しているため、拡張性が低く、適用範囲が限られており、その作成には膨大な時間と手作業が必要である。本研究では、MLE-Smithを提案する。これは、完全に自動化されたマルチエージェントパイプラインであり、未加工のデータセットを競技形式のMLE課題に変換するための効率的な生成-検証-実行パラダイムを採用し、検証可能な品質、実世界での有用性、および豊富な多様性を備えたMLEタスクのスケーリングを実現する。MLE-Smithにおける提案されたマルチエージェントパイプラインは、構造化されたタスク設計と標準化されたリファクタリングを推進し、厳密な構造ルールと高レベルの意味的整合性を強制するハイブリッド検証メカニズムを組み合わせている。さらに、インタラクティブな実行を通じて、経験的な解決可能性と実世界の忠実性を検証する。MLE-Smithを224の実世界のデータセットに適用し、複数のカテゴリ、目的、およびモダリティにわたる606のタスクを生成し、MLE-Smithが幅広い実世界のデータセットに対して効果的に機能することを実証した。生成されたタスクに対する評価では、8つの主流および最先端のLLMのMLE-Smithタスクにおけるパフォーマンスが、慎重に人間が設計したタスクにおけるパフォーマンスと強く相関していることが示され、MLE-Smithがタスクの品質を維持しながらMLEタスクをスケールアップする効果を強調している。

English

While Language Models (LMs) have made significant progress in automating machine learning engineering (MLE), the acquisition of high-quality MLE training data is significantly constrained. Current MLE benchmarks suffer from low scalability and limited applicability because they rely on static, manually curated tasks, demanding extensive time and manual effort to produce. We introduce MLE-Smith, a fully automated multi-agent pipeline, to transform raw datasets into competition-style MLE challenges through an efficient generate-verify-execute paradigm for scaling MLE tasks with verifiable quality, real-world usability, and rich diversity. The proposed multi-agent pipeline in MLE-Smith drives structured task design and standardized refactoring, coupled with a hybrid verification mechanism that enforces strict structural rules and high-level semantic soundness. It further validates empirical solvability and real-world fidelity through interactive execution. We apply MLE-Smith to 224 of real-world datasets and generate 606 tasks spanning multiple categories, objectives, and modalities, demonstrating that MLE-Smith can work effectively across a wide range of real-world datasets. Evaluation on the generated tasks shows that the performance of eight mainstream and cutting-edge LLMs on MLE-Smith tasks is strongly correlated with their performance on carefully human-designed tasks, highlighting the effectiveness of the MLE-Smith to scaling up MLE tasks, while maintaining task quality.

MLE-Smith: 自動化マルチエージェントパイプラインによるMLEタスクのスケーリング

MLE-Smith: Scaling MLE Tasks with Automated Multi-Agent Pipeline

要旨

Support