EPO:面向大语言模型代理的熵正则化策略优化 强化学习
EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning
September 26, 2025
作者: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
cs.AI
摘要
在多回合稀疏奖励环境中训练大型语言模型(LLM)代理,其中完成单个任务需要在一个回合内进行30次以上的交互,这对强化学习提出了根本性挑战。我们识别出这一场景下特有的关键失败模式:探索-利用级联失效。这一级联始于早期策略的过早收敛,稀疏反馈导致代理固守有缺陷、低熵的策略。随后,代理进入晚期策略崩溃阶段,此时传统的熵正则化反而适得其反,鼓励混乱的探索,从而破坏训练稳定性。我们提出了熵正则化策略优化(EPO),一个通过三种协同机制打破这一失败循环的通用框架:(1)在多回合设置中采用熵正则化以增强探索,(2)引入熵平滑正则化器,将策略熵限制在历史平均值范围内,防止剧烈波动,(3)自适应阶段权重调整,在训练过程中平衡探索与利用。我们的分析证明,EPO在保证收敛的同时,确保熵方差单调递减。在ScienceWorld上,EPO实现了高达152%的性能提升,在ALFWorld上提升了19.8%。我们的研究表明,多回合稀疏奖励环境需要与传统强化学习截然不同的熵控制方法,这对LLM代理训练具有广泛意义。
English
Training LLM agents in multi-turn environments with sparse rewards, where
completing a single task requires 30+ turns of interaction within an episode,
presents a fundamental challenge for reinforcement learning. We identify a
critical failure mode unique to this setting: the exploration-exploitation
cascade failure. This cascade begins with early-stage policy premature
convergence, where sparse feedback causes agents to commit to flawed,
low-entropy strategies. Subsequently, agents enter late-stage policy collapse,
where conventional entropy regularization becomes counterproductive, promoting
chaotic exploration that destabilizes training. We propose Entropy-regularized
Policy Optimization (EPO), a general framework that breaks this failure cycle
through three synergistic mechanisms: (1) adopting entropy regularization in
multi-turn settings to enhance exploration, (2) an entropy smoothing
regularizer that bounds policy entropy within historical averages to prevent
abrupt fluctuations, and (3) adaptive phase-based weighting that balances
exploration and exploitation across training. Our analysis justifies that EPO
guarantees monotonically decreasing entropy variance while maintaining
convergence. EPO achieves up to 152% performance improvement on ScienceWorld
and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn
sparse-reward settings require fundamentally different entropy control than
traditional RL, with broad implications for LLM agent training.