ChatPaper.aiChatPaper

EnvFactory:透過可執行環境合成與穩健強化學習擴展工具使用代理

EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

May 18, 2026
作者: Minrui Xu, Zilin Wang, Mengyi DENG, Zhiwei Li, Zhicheng Yang, Xiao Zhu, Yinhong Liu, Boyu Zhu, Baiyu Huang, Chao Chen, Heyuan Deng, Fei Mi, Lifeng Shang, Xingshan Zeng, Zhijiang Guo
cs.AI

摘要

透過基於代理的強化學習(Agentic RL)賦予大型語言模型工具使用能力,目前面臨兩大瓶頸:缺乏可擴展且穩健的執行環境,以及缺乏能捕捉人類隱含推理過程的真實訓練數據。現有方法依賴成本高昂的真實世界API、易產生幻覺的大型語言模型模擬器,或常為單輪對話、依賴預先收集文件的合成環境。此外,合成軌跡往往過度規範,更像指令序列而非自然的人類意圖,降低了其在強化學習訓練中的有效性。我們提出EnvFactory,一個全自動化框架,同時解決上述兩項挑戰。EnvFactory能自主從真實資源中探索並驗證具狀態性、可執行的工具環境,並透過拓撲感知取樣與校準式精煉,合成自然的多輪軌跡,產出具隱含意圖的基礎查詢。僅使用來自7個領域的85個經過驗證的環境,EnvFactory便能生成2,575條監督式微調與強化學習軌跡。儘管使用的環境數量遠少於先前研究(常為其五倍以上),EnvFactory仍實現了更優的訓練效率與下游效能,在BFCLv3上將Qwen3系列模型提升最多+15%,在MCP-Atlas上提升+8.6%,在包含τ²-Bench與VitaBench的對話基準測試上提升+6%。透過完全自動化環境建構與軌跡合成,EnvFactory為基於代理的強化學習提供了可擴展、可擴充且穩健的基礎。
English
Equipping LLMs with tool-use capabilities via Agentic Reinforcement Learning (Agentic RL) is bottlenecked by two challenges: the lack of scalable, robust execution environments and the scarcity of realistic training data that captures implicit human reasoning. Existing approaches depend on costly real-world APIs, hallucination-prone LLM simulators, or synthetic environments that are often single-turn or depend on pre-collected documents. Moreover, synthetic trajectories are frequently over-specified, resembling instruction sequences rather than natural human intents, reducing their effectiveness for RL training. We introduce EnvFactory, a fully automated framework that addresses both challenges. EnvFactory autonomously explores and verifies stateful, executable tool environments from authentic resources, and synthesizes natural multi-turn trajectories through topology-aware sampling and calibrated refinement, producing grounded queries with implicit intents. Using only 85 verified environments across 7 domains, EnvFactory generates 2,575 SFT and RL trajectories. Despite using significantly fewer environments than prior work, which are often 5 times more, EnvFactory achieves superior training efficiency and downstream performance, improving Qwen3-series models by up to +15% on BFCLv3, +8.6% on MCP-Atlas, and +6% on conversational benchmarks including τ^2-Bench and VitaBench. By fully automating both environment construction and trajectory synthesis, EnvFactory provides a scalable, extensible, and robust foundation for Agentic RL.