ChatPaper.aiChatPaper

透過經驗合成擴展智能體學習

Scaling Agent Learning via Experience Synthesis

November 5, 2025
作者: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
cs.AI

摘要

儘管強化學習(RL)能通過互動式自我改進來增強大型語言模型(LLM)智能體能力,但其實際應用仍面臨諸多挑戰:昂貴的環境推演成本、有限的任務多樣性、不可靠的獎勵信號以及複雜的基礎設施,這些因素都阻礙了可擴展經驗數據的收集。為解決這些難題,我們提出DreamGym——首個專注於可擴展性設計的統一框架,通過合成多樣化經驗數據來支持自主智能體的高效線上RL訓練。DreamGym無需依賴昂貴的真實環境推演,而是將環境動態提煉為基於推理的經驗模型,通過逐步推導生成一致的狀態轉換與反饋信號,從而實現可擴展的RL智能體推演數據收集。為提升狀態轉換的穩定性與質量,DreamGym利用經離線真實數據初始化的經驗回放緩衝區,並持續注入新互動數據以動態支持智能體訓練。在知識獲取方面,DreamGym自適應生成挑戰當前智能體策略的新任務,實現更高效的線上課程學習。在多樣化環境與智能體架構上的實驗表明,DreamGym能顯著提升RL訓練效果,無論在完全合成場景還是模擬到真實的遷移情境中均表現優異。在WebArena等非RL就緒任務上,DreamGym以超過30%的優勢全面超越基線方法;在RL就緒但成本高昂的設定中,僅憑合成互動即可匹配GRPO和PPO的性能。當將純合成經驗訓練的策略遷移至真實環境RL時,DreamGym在大幅減少真實互動次數的同時帶來顯著性能提升,為通用RL提供了可擴展的熱啟動策略。
English
While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
PDF792December 2, 2025