AgentGym-RL:通过多轮强化学习训练LLM代理进行长程决策
AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning
September 10, 2025
作者: Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
cs.AI
摘要
开发能够通过一系列智能决策解决复杂现实任务的自主大语言模型(LLM)代理,是一个快速发展的前沿领域。与人类认知发展类似,代理被期望通过探索和与环境互动来获取知识和技能。尽管已有诸多进展,研究界仍缺乏一个统一的、交互式的强化学习(RL)框架,能够在多样且真实的环境中,无需依赖监督微调(SFT),从头开始有效训练此类代理。为填补这一空白,我们引入了AgentGym-RL,这是一个通过RL训练LLM代理进行多轮交互决策的新框架。该框架采用模块化和解耦的架构,确保了高度的灵活性和可扩展性,涵盖了广泛的现实场景,并支持主流的RL算法。此外,我们提出了ScalingInter-RL,一种旨在平衡探索与利用并实现稳定RL优化的训练方法。在早期阶段,它通过限制交互次数强调利用,随后逐步转向更大范围的探索,以鼓励多样化的解题策略。这样,代理能够发展出更多样化的行为,且在长时间跨度下不易崩溃。我们进行了大量实验,验证了AgentGym-RL框架和ScalingInter-RL方法的稳定性和有效性。我们的代理在多种环境下的27项任务中,表现与商业模型相当或更优。我们提供了关键见解,并将开源完整的AgentGym-RL框架——包括代码和数据集——以赋能研究社区开发下一代智能代理。
English
Developing autonomous LLM agents capable of making a series of intelligent
decisions to solve complex, real-world tasks is a fast-evolving frontier. Like
human cognitive development, agents are expected to acquire knowledge and
skills through exploration and interaction with the environment. Despite
advances, the community still lacks a unified, interactive reinforcement
learning (RL) framework that can effectively train such agents from scratch --
without relying on supervised fine-tuning (SFT) -- across diverse and realistic
environments. To bridge this gap, we introduce AgentGym-RL, a new framework to
train LLM agents for multi-turn interactive decision-making through RL. The
framework features a modular and decoupled architecture, ensuring high
flexibility and extensibility. It encompasses a wide variety of real-world
scenarios, and supports mainstream RL algorithms. Furthermore, we propose
ScalingInter-RL, a training approach designed for exploration-exploitation
balance and stable RL optimization. In early stages, it emphasizes exploitation
by restricting the number of interactions, and gradually shifts towards
exploration with larger horizons to encourage diverse problem-solving
strategies. In this way, the agent develops more diverse behaviors and is less
prone to collapse under long horizons. We perform extensive experiments to
validate the stability and effectiveness of both the AgentGym-RL framework and
the ScalingInter-RL approach. Our agents match or surpass commercial models on
27 tasks across diverse environments. We offer key insights and will
open-source the complete AgentGym-RL framework -- including code and datasets
-- to empower the research community in developing the next generation of
intelligent agents.