EnvFactory：透過可執行環境合成與穩健強化學習擴展工具使用代理

摘要

透過基於代理的強化學習（Agentic RL）賦予大型語言模型工具使用能力，目前面臨兩大瓶頸：缺乏可擴展且穩健的執行環境，以及缺乏能捕捉人類隱含推理過程的真實訓練數據。現有方法依賴成本高昂的真實世界API、易產生幻覺的大型語言模型模擬器，或常為單輪對話、依賴預先收集文件的合成環境。此外，合成軌跡往往過度規範，更像指令序列而非自然的人類意圖，降低了其在強化學習訓練中的有效性。我們提出EnvFactory，一個全自動化框架，同時解決上述兩項挑戰。EnvFactory能自主從真實資源中探索並驗證具狀態性、可執行的工具環境，並透過拓撲感知取樣與校準式精煉，合成自然的多輪軌跡，產出具隱含意圖的基礎查詢。僅使用來自7個領域的85個經過驗證的環境，EnvFactory便能生成2,575條監督式微調與強化學習軌跡。儘管使用的環境數量遠少於先前研究（常為其五倍以上），EnvFactory仍實現了更優的訓練效率與下游效能，在BFCLv3上將Qwen3系列模型提升最多+15%，在MCP-Atlas上提升+8.6%，在包含τ²-Bench與VitaBench的對話基準測試上提升+6%。透過完全自動化環境建構與軌跡合成，EnvFactory為基於代理的強化學習提供了可擴展、可擴充且穩健的基礎。

English

Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.