ChatPaper.aiChatPaper

通过经验合成扩展智能体学习

Scaling Agent Learning via Experience Synthesis

November 5, 2025
作者: Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
cs.AI

摘要

尽管强化学习(RL)能够通过交互式自我优化赋能大语言模型(LLM)智能体,但其实际应用仍面临诸多挑战:昂贵的环境交互成本、有限的任务多样性、不可靠的奖励信号以及复杂的基础设施要求,这些因素共同阻碍了可扩展经验数据的获取。为解决这些问题,我们提出DreamGym——首个以可扩展性为核心目标、通过合成多样化经验来实现自主智能体高效在线强化学习的统一框架。该框架摒弃成本高昂的真实环境交互,将环境动态蒸馏为基于推理的经验模型,通过逐步推理生成连贯的状态转移与反馈信号,从而实现可扩展的智能体交互数据收集。为提升状态转移的稳定性和质量,DreamGym采用由真实世界离线数据初始化的经验回放缓冲区,并通过持续注入新交互数据动态支持智能体训练。在知识获取层面,框架自适应生成挑战当前策略的新任务,实现更高效的在线课程学习。跨环境与智能体架构的实验表明,DreamGym在纯合成场景与仿真到现实迁移场景中均能显著提升强化学习效果。在WebArena等非RL就绪任务上,其性能超越所有基线方法30%以上;在RL就绪但成本高昂的场景中,仅通过合成交互即可匹配GRPO和PPO的表现。当将纯合成经验训练的策略迁移至真实环境时,DreamGym能以极少的真实交互实现显著性能提升,为通用强化学习提供了可扩展的热启动策略。
English
While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
PDF792December 2, 2025