智能體世界模型:面向智能體強化學習的無限合成環境
Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning
February 10, 2026
作者: Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He
cs.AI
摘要
近年來大型語言模型(LLM)的突破性進展,使自主代理能夠執行需要與工具及環境進行多輪互動的複雜任務。然而,由於缺乏多樣化且可靠的環境資源,這類代理訓練的規模化發展受到限制。本文提出代理世界模型(AWM)——一個完全合成式的環境生成流程。通過該流程,我們成功構建了涵蓋日常場景的1,000個環境,代理可在其中與豐富的工具集(平均每個環境35種工具)互動並獲取高質量觀測數據。值得注意的是,這些環境由代碼驅動並以數據庫為支撐,相較於LLM模擬的環境能提供更可靠、一致的狀態轉換。此外,與從真實環境收集軌跡數據相比,該方法能實現更高效的代理互動。為驗證此資源的有效性,我們針對多輪工具使用代理進行大規模強化學習訓練。得益於完全可執行的環境與可訪問的數據庫狀態,我們還能設計出可靠的獎勵函數。在三個基準測試上的實驗表明,僅在合成環境中訓練(而非針對特定基準環境)能產生強大的分佈外泛化能力。程式碼已開源於:https://github.com/Snowflake-Labs/agent-world-model。
English
Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.