RAGEN: 다중 턴 강화 학습을 통한 LLM 에이전트의 자기 진화 이해

초록

대규모 언어 모델(LLM)을 상호작용 에이전트로 훈련시키는 것은 장기적 의사결정과 확률적 환경 피드백과의 상호작용을 포함한 독특한 과제를 제시합니다. 강화학습(RL)이 정적 작업에서의 진전을 가능하게 했지만, 다중 턴 에이전트 RL 훈련은 여전히 미개척 분야로 남아 있습니다. 우리는 궤적 수준 에이전트 RL을 위한 일반적인 프레임워크인 StarPO(State-Thinking-Actions-Reward Policy Optimization)를 제안하고, LLM 에이전트의 훈련 및 평가를 위한 모듈식 시스템인 RAGEN을 소개합니다. 세 가지 스타일화된 환경에 대한 연구를 통해 세 가지 핵심 발견을 도출했습니다. 첫째, 우리의 에이전트 RL 훈련은 보상 분산 절벽과 그래디언트 급상승을 특징으로 하는 Echo Trap 모드가 반복적으로 나타났으며, 이를 궤적 필터링, 비평가 통합, 분리된 클리핑을 포함한 안정화 변형인 StarPO-S로 해결했습니다. 둘째, RL 롤아웃의 형성은 다양한 초기 상태, 중간 수준의 상호작용 세분성, 더 빈번한 샘플링으로부터 이점을 얻을 수 있음을 발견했습니다. 셋째, 세밀하고 추론을 고려한 보상 신호 없이는, 다중 턴 RL을 통해 에이전트의 추론이 거의 나타나지 않으며, 피상적인 전략이나 환각적 사고를 보일 수 있음을 확인했습니다. 코드와 환경은 https://github.com/RAGEN-AI/RAGEN에서 확인할 수 있습니다.

English

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

RAGEN: 다중 턴 강화 학습을 통한 LLM 에이전트의 자기 진화 이해

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

초록

Support