AgentGym-RL: マルチターン強化学習による長期的意思決定のためのLLMエージェントのトレーニング

要旨

複雑な現実世界のタスクを解決するために一連の知的判断を行う自律的なLLMエージェントの開発は、急速に進化するフロンティアである。人間の認知発達と同様に、エージェントは環境との探索と相互作用を通じて知識とスキルを獲得することが期待されている。進展はあるものの、多様で現実的な環境において、教師ありファインチューニング（SFT）に依存せずに、ゼロからそのようなエージェントを効果的に訓練できる統一されたインタラクティブな強化学習（RL）フレームワークは、まだコミュニティに欠けている。このギャップを埋めるため、我々はAgentGym-RLを導入する。これは、RLを通じてマルチターンのインタラクティブな意思決定を行うLLMエージェントを訓練するための新しいフレームワークである。このフレームワークは、モジュール化され分離されたアーキテクチャを特徴とし、高い柔軟性と拡張性を確保している。また、多様な現実世界のシナリオを網羅し、主流のRLアルゴリズムをサポートする。さらに、我々はScalingInter-RLを提案する。これは、探索と活用のバランスと安定したRL最適化のために設計された訓練アプローチである。初期段階では、相互作用の数を制限することで活用を重視し、徐々に大きなホライズンに向けて探索にシフトし、多様な問題解決戦略を促進する。これにより、エージェントはより多様な行動を発展させ、長いホライズンでの崩壊に陥りにくくなる。我々は、AgentGym-RLフレームワークとScalingInter-RLアプローチの安定性と有効性を検証するために広範な実験を行った。我々のエージェントは、多様な環境における27のタスクで商用モデルに匹敵またはそれを上回る性能を示した。我々は重要な洞察を提供し、研究コミュニティが次世代の知的エージェントを開発するために、コードとデータセットを含む完全なAgentGym-RLフレームワークをオープンソースとして公開する予定である。

English

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch -- without relying on supervised fine-tuning (SFT) -- across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interactions, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and the ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source the complete AgentGym-RL framework -- including code and datasets -- to empower the research community in developing the next generation of intelligent agents.

AgentGym-RL: マルチターン強化学習による長期的意思決定のためのLLMエージェントのトレーニング

AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

要旨

Support