EPO：面向大语言模型代理的熵正则化策略优化强化学习

摘要

在多回合稀疏奖励环境中训练大型语言模型（LLM）代理，其中完成单个任务需要在一个回合内进行30次以上的交互，这对强化学习提出了根本性挑战。我们识别出这一场景下特有的关键失败模式：探索-利用级联失效。这一级联始于早期策略的过早收敛，稀疏反馈导致代理固守有缺陷、低熵的策略。随后，代理进入晚期策略崩溃阶段，此时传统的熵正则化反而适得其反，鼓励混乱的探索，从而破坏训练稳定性。我们提出了熵正则化策略优化（EPO），一个通过三种协同机制打破这一失败循环的通用框架：（1）在多回合设置中采用熵正则化以增强探索，（2）引入熵平滑正则化器，将策略熵限制在历史平均值范围内，防止剧烈波动，（3）自适应阶段权重调整，在训练过程中平衡探索与利用。我们的分析证明，EPO在保证收敛的同时，确保熵方差单调递减。在ScienceWorld上，EPO实现了高达152%的性能提升，在ALFWorld上提升了19.8%。我们的研究表明，多回合稀疏奖励环境需要与传统强化学习截然不同的熵控制方法，这对LLM代理训练具有广泛意义。

English

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

EPO：面向大语言模型代理的熵正则化策略优化强化学习

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

摘要

Support