AI科学者への道：合成タスクスケーリングによるアプローチ

要旨

AIエージェントの登場により、自動化された科学的発見は達成可能な目標となった。近年、機械学習研究を実行可能なエージェントシステムを構築する研究が多く発表されているが、こうしたエージェントを訓練するための原理的な方法は提供されておらず、現在の大規模言語モデル（LLM）は、もっともらしいが実効性の低いアイデアを生成することが多い。実践を通じて学習可能なエージェントの訓練を前進させるため、我々は機械学習エージェントを対象とした新しい合成環境生成パイプラインを提案する。本パイプラインは、トピックサンプリング、データセット提案、コード生成を網羅し、SWE-agentフレームワークと互換性のある機械学習課題を自動的に合成する。結果として得られる合成タスクは、1) 提案されるデータセットがHuggingface APIに対して検証されるため、実際の機械学習データセットに基づいており、2) 自己デバッグループにより高品質であることが検証されている。合成タスクの有効性を検証するため、我々は機械学習タスクのベンチマークであるMLGymに取り組んだ。合成タスクから教師モデル（GPT-5）の軌跡をサンプリングし、その軌跡を用いて学生モデル（Qwen3-4BおよびQwen3-8B）を訓練した。我々の合成タスクで訓練された学生モデルは、MLGymにおいて性能向上を達成し、Qwen3-4BではAUPメトリクスが9%、Qwen3-8Bでは12%向上した。

English

With the advent of AI agents, automatic scientific discovery has become a tenable goal. Many recent works scaffold agentic systems that can perform machine learning research, but don't offer a principled way to train such agents -- and current LLMs often generate plausible-looking but ineffective ideas. To make progress on training agents that can learn from doing, we provide a novel synthetic environment generation pipeline targeting machine learning agents. Our pipeline automatically synthesizes machine learning challenges compatible with the SWE-agent framework, covering topic sampling, dataset proposal, and code generation. The resulting synthetic tasks are 1) grounded in real machine learning datasets, because the proposed datasets are verified against the Huggingface API and are 2) verified for higher quality with a self-debugging loop. To validate the effectiveness of our synthetic tasks, we tackle MLGym, a benchmark for machine learning tasks. From the synthetic tasks, we sample trajectories from a teacher model (GPT-5), then use the trajectories to train a student model (Qwen3-4B and Qwen3-8B). The student models trained with our synthetic tasks achieve improved performance on MLGym, raising the AUP metric by 9% for Qwen3-4B and 12% for Qwen3-8B.

AI科学者への道：合成タスクスケーリングによるアプローチ

AI Scientist via Synthetic Task Scaling

要旨

Support