EPO：面向大型語言模型代理的熵正則化策略優化強化學習

摘要

在多回合環境中訓練具有稀疏獎勵的大型語言模型（LLM）代理，其中完成單一任務需要在一個回合內進行超過30次的互動，這對強化學習提出了根本性的挑戰。我們發現了一種在這種情境下獨特的關鍵失敗模式：探索-利用級聯失敗。這種級聯始於早期策略的過早收斂，稀疏的反饋導致代理採取了有缺陷且低熵的策略。隨後，代理進入晚期策略崩潰階段，傳統的熵正則化在此時反而適得其反，促進了混亂的探索，從而破壞了訓練的穩定性。我們提出了熵正則化策略優化（EPO），這是一個通過三種協同機制打破這一失敗循環的通用框架：（1）在多回合設置中採用熵正則化以增強探索，（2）引入熵平滑正則化器，將策略熵限制在歷史平均值範圍內，以防止劇烈波動，（3）自適應的基於階段的權重調整，在訓練過程中平衡探索與利用。我們的分析證明，EPO在保證收斂的同時，確保了熵方差的單調遞減。EPO在ScienceWorld上實現了高達152%的性能提升，在ALFWorld上提升了19.8%。我們的工作表明，多回合稀疏獎勵設置需要與傳統強化學習截然不同的熵控制方法，這對LLM代理訓練具有廣泛的意義。

English

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

EPO：面向大型語言模型代理的熵正則化策略優化強化學習

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

摘要

Support