ChatPaper.aiChatPaper

EPO:面向大型語言模型代理的熵正則化策略優化 強化學習

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

September 26, 2025
作者: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
cs.AI

摘要

在多回合環境中訓練具有稀疏獎勵的大型語言模型(LLM)代理,其中完成單一任務需要在一個回合內進行超過30次的互動,這對強化學習提出了根本性的挑戰。我們發現了一種在這種情境下獨特的關鍵失敗模式:探索-利用級聯失敗。這種級聯始於早期策略的過早收斂,稀疏的反饋導致代理採取了有缺陷且低熵的策略。隨後,代理進入晚期策略崩潰階段,傳統的熵正則化在此時反而適得其反,促進了混亂的探索,從而破壞了訓練的穩定性。我們提出了熵正則化策略優化(EPO),這是一個通過三種協同機制打破這一失敗循環的通用框架:(1)在多回合設置中採用熵正則化以增強探索,(2)引入熵平滑正則化器,將策略熵限制在歷史平均值範圍內,以防止劇烈波動,(3)自適應的基於階段的權重調整,在訓練過程中平衡探索與利用。我們的分析證明,EPO在保證收斂的同時,確保了熵方差的單調遞減。EPO在ScienceWorld上實現了高達152%的性能提升,在ALFWorld上提升了19.8%。我們的工作表明,多回合稀疏獎勵設置需要與傳統強化學習截然不同的熵控制方法,這對LLM代理訓練具有廣泛的意義。
English
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
PDF1142September 29, 2025