EnvFactory: 実行可能環境合成とロバスト強化学習によるツール使用エージェントのスケーリング

要旨

エージェント強化学習（Agentic RL）を介してLLMにツール使用能力を付与することは、スケーラブルで堅牢な実行環境の欠如と、暗黙的な人間の推論を捉えた現実的なトレーニングデータの不足という二つの課題によってボトルネックとなっている。既存のアプローチは、高コストな実世界API、幻覚を起こしやすいLLMシミュレータ、あるいは多くの場合単一ターンであるか事前収集された文書に依存する合成環境に依存している。さらに、合成軌道はしばしば過剰に指定されており、自然な人間の意図ではなく命令シーケンスに類似しているため、RLトレーニングにおける有効性が低下している。我々は、両方の課題に対処する完全自動化フレームワークであるEnvFactoryを紹介する。EnvFactoryは、信頼できるリソースからステートフルで実行可能なツール環境を自律的に探索・検証し、トポロジーを考慮したサンプリングと調整されたリファインメントを通じて自然なマルチターン軌道を合成し、暗黙的な意図を持つ接地されたクエリを生成する。わずか7ドメインにわたる85の検証済み環境のみを使用して、EnvFactoryは2,575のSFTおよびRL軌道を生成する。先行研究（多くの場合5倍以上の環境を使用）に比べて著しく少ない環境を使用しているにもかかわらず、EnvFactoryは優れたトレーニング効率と下流性能を達成し、BFCLv3で最大+15%、MCP-Atlasで+8.6%、τ^2-BenchやVitaBenchを含む会話ベンチマークで+6%の改善をQwen3シリーズモデルにもたらしている。環境構築と軌道合成の両方を完全に自動化することにより、EnvFactoryはエージェント強化学習のためのスケーラブルで拡張可能かつ堅牢な基盤を提供する。

English

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.