ChatPaper.aiChatPaper

EPO:面向大语言模型代理的熵正则化策略优化 强化学习

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

September 26, 2025
作者: Xu Wujiang, Wentian Zhao, Zhenting Wang, Li Yu-Jhe, Jin Can, Jin Mingyu, Mei Kai, Wan Kun, Metaxas Dimitris
cs.AI

摘要

在多回合稀疏奖励环境中训练大型语言模型(LLM)代理,其中完成单个任务需要在一个回合内进行30次以上的交互,这对强化学习提出了根本性挑战。我们识别出这一场景下特有的关键失败模式:探索-利用级联失效。这一级联始于早期策略的过早收敛,稀疏反馈导致代理固守有缺陷、低熵的策略。随后,代理进入晚期策略崩溃阶段,此时传统的熵正则化反而适得其反,鼓励混乱的探索,从而破坏训练稳定性。我们提出了熵正则化策略优化(EPO),一个通过三种协同机制打破这一失败循环的通用框架:(1)在多回合设置中采用熵正则化以增强探索,(2)引入熵平滑正则化器,将策略熵限制在历史平均值范围内,防止剧烈波动,(3)自适应阶段权重调整,在训练过程中平衡探索与利用。我们的分析证明,EPO在保证收敛的同时,确保熵方差单调递减。在ScienceWorld上,EPO实现了高达152%的性能提升,在ALFWorld上提升了19.8%。我们的研究表明,多回合稀疏奖励环境需要与传统强化学习截然不同的熵控制方法,这对LLM代理训练具有广泛意义。
English
Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
PDF1142September 29, 2025