EPO: 大規模言語モデルエージェントのためのエントロピー正則化ポリシー最適化強化学習

要旨

マルチターン環境におけるスパース報酬下でのLLMエージェントの訓練は、1つのタスクを完了するためにエピソード内で30回以上のインタラクションを必要とするため、強化学習における根本的な課題を提起します。本研究では、この設定に特有の重要な失敗モード、すなわち「探索-活用カスケード失敗」を特定しました。このカスケードは、初期段階でのポリシーの早期収束から始まります。スパースなフィードバックにより、エージェントは欠陥のある低エントロピー戦略に固執してしまいます。その後、エージェントは後期段階でのポリシー崩壊に陥ります。従来のエントロピー正則化が逆効果となり、カオス的な探索を促進し、訓練を不安定化させます。本研究では、この失敗サイクルを打破するための一般的なフレームワークとして、エントロピー正則化ポリシー最適化（EPO）を提案します。EPOは、以下の3つの相乗的メカニズムを通じて機能します：(1) マルチターン設定でのエントロピー正則化を採用し、探索を強化する、(2) ポリシーエントロピーを履歴平均内に制限するエントロピースムージング正則化器を導入し、急激な変動を防ぐ、(3) 訓練全体で探索と活用のバランスを取る適応的なフェーズベースの重み付けを行う。我々の分析により、EPOが収束を維持しながらエントロピー分散を単調減少させることを保証することを示します。EPOは、ScienceWorldで最大152%、ALFWorldで最大19.8%の性能向上を達成しました。本研究は、マルチターンスパース報酬設定では、従来の強化学習とは根本的に異なるエントロピー制御が必要であり、LLMエージェント訓練に広範な影響を及ぼすことを示しています。

English

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

EPO: 大規模言語モデルエージェントのためのエントロピー正則化ポリシー最適化強化学習

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

要旨

Support