AgentGym-RL: 다중 턴 강화 학습을 통한 장기적 의사결정을 위한 LLM 에이전트 훈련

초록

복잡한 현실 세계의 문제를 해결하기 위해 일련의 지능적인 결정을 내릴 수 있는 자율적인 LLM 에이전트를 개발하는 것은 빠르게 진화하는 최전선 분야입니다. 인간의 인지 발달과 마찬가지로, 에이전트는 환경과의 탐색 및 상호작용을 통해 지식과 기술을 습득할 것으로 기대됩니다. 그럼에도 불구하고, 커뮤니티는 여전히 다양한 현실적인 환경에서 감독된 미세 조정(SFT)에 의존하지 않고 이러한 에이전트를 처음부터 효과적으로 훈련할 수 있는 통합된 상호작용형 강화 학습(RL) 프레임워크가 부족합니다. 이러한 격차를 해소하기 위해, 우리는 RL을 통해 다중 턴 상호작용 의사결정을 위한 LLM 에이전트를 훈련시키는 새로운 프레임워크인 AgentGym-RL을 소개합니다. 이 프레임워크는 모듈화되고 분리된 아키텍처를 특징으로 하여 높은 유연성과 확장성을 보장합니다. 또한 다양한 현실 세계 시나리오를 포괄하며, 주류 RL 알고리즘을 지원합니다. 더 나아가, 우리는 탐색-탐사 균형과 안정적인 RL 최적화를 위해 설계된 훈련 접근법인 ScalingInter-RL을 제안합니다. 초기 단계에서는 상호작용 횟수를 제한하여 탐사를 강조하고, 점차 더 큰 범위로 탐색을 강조하여 다양한 문제 해결 전략을 장려합니다. 이를 통해 에이전트는 더 다양한 행동을 개발하고, 장기적인 범위에서 붕괴될 가능성이 적어집니다. 우리는 AgentGym-RL 프레임워크와 ScalingInter-RL 접근법의 안정성과 효과성을 검증하기 위해 광범위한 실험을 수행했습니다. 우리의 에이전트는 다양한 환경에서 27개의 작업에서 상용 모델을 능가하거나 동등한 성능을 보였습니다. 우리는 주요 통찰을 제공하고, 연구 커뮤니티가 차세대 지능형 에이전트를 개발할 수 있도록 코드와 데이터셋을 포함한 완전한 AgentGym-RL 프레임워크를 오픈소스로 공개할 예정입니다.

English

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

AgentGym-RL: 다중 턴 강화 학습을 통한 장기적 의사결정을 위한 LLM 에이전트 훈련

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

초록

Support