機械学習エンジニアリングエージェントの訓練のための合成サンドボックス

要旨

大規模言語モデルエージェントがソフトウェアエンジニアリング（SWE）タスクを超えて機械学習エンジニアリング（MLE）領域へ進化するにつれ、エージェントの動作検証コストは桁違いに増大している。SWEタスクが高速に実行可能な単体テストで検証できるのに対し、MLEの検証には大規模データセットを用いた完全なMLパイプライン（データ前処理、モデル学習、指標評価）の実行が各ロールアウト段階で必要となり、軌道単位のオン方策強化学習（RL）は現実的な時間では不可能となる。既存手法は教師ありファインチューニング（SFT）やオフライン代理報酬に後退せざるを得ず、オン方策RLがもたらす探索性と汎化性能の利点を犠牲にしている。我々は、このボトルネックの主要因がサンドボックスデータの規模にあると考察する。この知見に基づき、少数のシードタスクから多様で検証可能な合成MLE環境を生成するマルチエージェントフレームワーク「SandMLE」を提案する。本フレームワークは実世界問題の構造的・技術的複雑性を保持しつつ、データセットをマイクロ規模（各タスクに50～200件の訓練サンプルのみを割り当て）に制約する。大規模実験を通じて、SandMLEが実行時間を13倍以上短縮し、MLE領域で初めて大規模な軌道単位のオン方策RLを実現可能にすることを示す。MLE-bench-liteでは、Qwen3-8B、14B、30B-A3Bモデル全体でSFTベースラインを大きく上回り、メダル獲得率が20.3%から66.9%向上した。さらに、学習された方策は未見のエージェント基盤に対しても汎化し、MLE-DojoにおいてHumanRankスコアで最大32.4%の改善を達成した。

English

As large language model agents advance beyond software engineering (SWE) tasks toward machine learning engineering (MLE), verifying agent behavior becomes orders of magnitude more expensive: while SWE tasks can be verified via fast-executing unit tests, MLE verification requires running full ML pipelines -- data preprocessing, model training, and metric evaluation -- on large datasets at each rollout step, rendering trajectory-wise on-policy reinforcement learning (RL) prohibitively slow. Existing approaches retreat to supervised fine-tuning (SFT) or offline proxy rewards, sacrificing the exploration and generalization benefits of on-policy RL. We observe that sandbox data size is the primary source of this bottleneck. Based on this insight, we introduce SandMLE, a multi-agent framework that generates diverse, verifiable synthetic MLE environments from a small number of seed tasks, preserving the structural and technical complexity of real-world problems while constraining datasets to micro-scale (each task is paired with only 50-200 training samples). Through extensive experiments, we show that SandMLE reduces execution time by over 13 times, enabling large-scale, on-policy trajectory-wise RL for the first time in the MLE domain. On MLE-bench-lite, SandMLE yields significant gains over SFT baselines across Qwen3-8B, 14B, and 30B-A3B, with relative medal rate improvements ranging from 20.3% to 66.9%. Furthermore, the trained policy generalizes across unseen agentic scaffolds, achieving up to 32.4% better HumanRank score on MLE-Dojo.

機械学習エンジニアリングエージェントの訓練のための合成サンドボックス

Synthetic Sandbox for Training Machine Learning Engineering Agents

要旨

Support