AgentGym-RL：通过多轮强化学习训练LLM代理进行长程决策

摘要

开发能够通过一系列智能决策解决复杂现实任务的自主大语言模型（LLM）代理，是一个快速发展的前沿领域。与人类认知发展类似，代理被期望通过探索和与环境互动来获取知识和技能。尽管已有诸多进展，研究界仍缺乏一个统一的、交互式的强化学习（RL）框架，能够在多样且真实的环境中，无需依赖监督微调（SFT），从头开始有效训练此类代理。为填补这一空白，我们引入了AgentGym-RL，这是一个通过RL训练LLM代理进行多轮交互决策的新框架。该框架采用模块化和解耦的架构，确保了高度的灵活性和可扩展性，涵盖了广泛的现实场景，并支持主流的RL算法。此外，我们提出了ScalingInter-RL，一种旨在平衡探索与利用并实现稳定RL优化的训练方法。在早期阶段，它通过限制交互次数强调利用，随后逐步转向更大范围的探索，以鼓励多样化的解题策略。这样，代理能够发展出更多样化的行为，且在长时间跨度下不易崩溃。我们进行了大量实验，验证了AgentGym-RL框架和ScalingInter-RL方法的稳定性和有效性。我们的代理在多种环境下的27项任务中，表现与商业模型相当或更优。我们提供了关键见解，并将开源完整的AgentGym-RL框架——包括代码和数据集——以赋能研究社区开发下一代智能代理。

English

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

AgentGym-RL：通过多轮强化学习训练LLM代理进行长程决策

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

摘要

Support