RAGEN：透過多輪強化學習理解LLM代理的自我演化

摘要

訓練大型語言模型（LLMs）作為互動代理面臨獨特的挑戰，包括長遠決策制定與隨機環境反饋的交互。雖然強化學習（RL）在靜態任務中取得了進展，但多輪代理的RL訓練仍未被充分探索。我們提出了StarPO（狀態-思考-行動-獎勵策略優化），這是一個針對軌跡級代理RL的通用框架，並引入了RAGEN，一個用於訓練和評估LLM代理的模塊化系統。我們在三個風格化環境中的研究揭示了三個核心發現。首先，我們的代理RL訓練顯示出一種反覆出現的“迴聲陷阱”模式，其中獎勵方差陡峭和梯度尖峰；我們通過StarPO-S解決了這一問題，這是一個具有軌跡過濾、批評器整合和解耦裁剪的穩定變體。其次，我們發現RL滾動的塑造將受益於多樣的初始狀態、中等交互粒度以及更頻繁的採樣。第三，我們表明，如果沒有細粒度、推理感知的獎勵信號，代理的推理很難通過多輪RL出現，並且它們可能表現出淺層策略或幻想的思維。代碼和環境可在https://github.com/RAGEN-AI/RAGEN獲取。

English

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

RAGEN：透過多輪強化學習理解LLM代理的自我演化

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

摘要

Support