RAGEN: マルチターン強化学習によるLLMエージェントの自己進化の理解

要旨

大規模言語モデル（LLM）をインタラクティブエージェントとして訓練することは、長期的な意思決定や確率的な環境フィードバックとの相互作用など、独特の課題を提示します。強化学習（RL）は静的なタスクにおいて進展を可能にしてきましたが、マルチターンエージェントのRL訓練はまだ十分に探求されていません。本論文では、軌跡レベルでのエージェントRLのための汎用フレームワークであるStarPO（State-Thinking-Actions-Reward Policy Optimization）を提案し、LLMエージェントの訓練と評価のためのモジュールシステムであるRAGENを紹介します。3つの様式化された環境での研究から、3つの核心的な知見が得られました。第一に、エージェントRL訓練では、報酬分散の崖と勾配スパイクが繰り返し現れる「エコートラップ」というモードが見られました。これに対処するため、軌跡フィルタリング、批評家の組み込み、デカップリングクリッピングを備えた安定化バリアントであるStarPO-Sを開発しました。第二に、RLロールアウトの形成には、多様な初期状態、中程度のインタラクション粒度、より頻繁なサンプリングが有益であることがわかりました。第三に、細かい粒度で推論を意識した報酬信号がない場合、マルチターンRLを通じてエージェントの推論がほとんど現れず、浅い戦略や幻想的な思考を示す可能性があることが示されました。コードと環境はhttps://github.com/RAGEN-AI/RAGENで公開されています。

English

Training large language models (LLMs) as interactive agents presents unique challenges including long-horizon decision making and interacting with stochastic environment feedback. While reinforcement learning (RL) has enabled progress in static tasks, multi-turn agent RL training remains underexplored. We propose StarPO (State-Thinking-Actions-Reward Policy Optimization), a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents. Our study on three stylized environments reveals three core findings. First, our agent RL training shows a recurring mode of Echo Trap where reward variance cliffs and gradient spikes; we address this with StarPO-S, a stabilized variant with trajectory filtering, critic incorporation, and decoupled clipping. Second, we find the shaping of RL rollouts would benefit from diverse initial states, medium interaction granularity and more frequent sampling. Third, we show that without fine-grained, reasoning-aware reward signals, agent reasoning hardly emerge through multi-turn RL and they may show shallow strategies or hallucinated thoughts. Code and environments are available at https://github.com/RAGEN-AI/RAGEN.

RAGEN: マルチターン強化学習によるLLMエージェントの自己進化の理解

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

要旨

Support