ChatPaper.aiChatPaper

AgentGym-RL:通過多輪強化學習訓練LLM代理進行長時序決策

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

September 10, 2025
作者: Zhiheng Xi, Jixuan Huang, Chenyang Liao, Baodai Huang, Honglin Guo, Jiaqi Liu, Rui Zheng, Junjie Ye, Jiazheng Zhang, Wenxiang Chen, Wei He, Yiwen Ding, Guanyu Li, Zehui Chen, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Tao Gui, Zuxuan Wu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
cs.AI

摘要

開發能夠做出系列智能決策以解決複雜現實任務的自主大型語言模型(LLM)代理,是一個快速發展的前沿領域。與人類認知發展相似,這些代理被期望通過探索和與環境的互動來獲取知識和技能。儘管已有進展,學術界仍缺乏一個統一的、互動式的強化學習(RL)框架,能夠在多樣且真實的環境中,從零開始有效地訓練此類代理——而無需依賴監督微調(SFT)。為填補這一空白,我們引入了AgentGym-RL,這是一個通過RL訓練LLM代理進行多輪互動決策的新框架。該框架採用模塊化和解耦的架構,確保了高度的靈活性和可擴展性。它涵蓋了廣泛的現實場景,並支持主流的RL算法。此外,我們提出了ScalingInter-RL,這是一種專為探索-利用平衡和穩定RL優化而設計的訓練方法。在早期階段,它通過限制互動次數來強調利用,並逐漸轉向更大範圍的探索,以鼓勵多樣化的問題解決策略。這樣,代理能夠發展出更為多樣的行為,並在長時間範圍內不易崩潰。我們進行了大量實驗,以驗證AgentGym-RL框架和ScalingInter-RL方法的穩定性和有效性。我們的代理在多樣環境中的27項任務上,表現與商業模型相當或更優。我們提供了關鍵見解,並將開源完整的AgentGym-RL框架——包括代碼和數據集——以助力研究界開發下一代智能代理。
English
Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.
PDF322September 11, 2025