RAGEN：通过多轮强化学习理解LLM代理的自我进化

摘要

训练大型语言模型（LLMs）作为交互式代理面临独特挑战，包括长期决策制定与随机环境反馈的交互。尽管强化学习（RL）在静态任务中取得了进展，但多轮代理RL训练仍待深入探索。我们提出了StarPO（状态-思考-动作-奖励策略优化），一个面向轨迹级代理RL的通用框架，并引入了RAGEN，一个用于训练和评估LLM代理的模块化系统。我们在三个典型环境中的研究揭示了三个核心发现。首先，我们的代理RL训练呈现出一种称为“回声陷阱”的重复模式，其中奖励方差陡增和梯度尖峰；我们通过StarPO-S，一个包含轨迹过滤、批评器整合和解耦裁剪的稳定变体来解决这一问题。其次，我们发现RL展开的塑造将受益于多样化的初始状态、中等交互粒度以及更频繁的采样。第三，我们证明，在没有细粒度、推理感知的奖励信号的情况下，代理的推理难以通过多轮RL涌现，它们可能表现出浅层策略或虚构的思维。代码和环境可在https://github.com/RAGEN-AI/RAGEN获取。

English

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

RAGEN：通过多轮强化学习理解LLM代理的自我进化

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

摘要

Support