RAGEN:通过多轮强化学习理解LLM代理的自我进化
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
April 24, 2025
作者: Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Monica Lam, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, Manling Li
cs.AI
摘要
训练大型语言模型(LLMs)作为交互式代理面临独特挑战,包括长期决策制定与随机环境反馈的交互。尽管强化学习(RL)在静态任务中取得了进展,但多轮代理RL训练仍待深入探索。我们提出了StarPO(状态-思考-动作-奖励策略优化),一个面向轨迹级代理RL的通用框架,并引入了RAGEN,一个用于训练和评估LLM代理的模块化系统。我们在三个典型环境中的研究揭示了三个核心发现。首先,我们的代理RL训练呈现出一种称为“回声陷阱”的重复模式,其中奖励方差陡增和梯度尖峰;我们通过StarPO-S,一个包含轨迹过滤、批评器整合和解耦裁剪的稳定变体来解决这一问题。其次,我们发现RL展开的塑造将受益于多样化的初始状态、中等交互粒度以及更频繁的采样。第三,我们证明,在没有细粒度、推理感知的奖励信号的情况下,代理的推理难以通过多轮RL涌现,它们可能表现出浅层策略或虚构的思维。代码和环境可在https://github.com/RAGEN-AI/RAGEN获取。
English
Training large language models (LLMs) as interactive agents presents unique
challenges including long-horizon decision making and interacting with
stochastic environment feedback. While reinforcement learning (RL) has enabled
progress in static tasks, multi-turn agent RL training remains underexplored.
We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a
general framework for trajectory-level agent RL, and introduce RAGEN, a modular
system for training and evaluating LLM agents. Our study on three stylized
environments reveals three core findings. First, our agent RL training shows a
recurring mode of Echo Trap where reward variance cliffs and gradient spikes;
we address this with StarPO-S, a stabilized variant with trajectory filtering,
critic incorporation, and decoupled clipping. Second, we find the shaping of RL
rollouts would benefit from diverse initial states, medium interaction
granularity and more frequent sampling. Third, we show that without
fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge
through multi-turn RL and they may show shallow strategies or hallucinated
thoughts. Code and environments are available at
https://github.com/RAGEN-AI/RAGEN.