驾驭不确定性：面向长程LLM智能体的熵调制策略梯度

摘要

在长期任务中，基于大语言模型（LLMs）的智能体面临一个重大挑战：稀疏的、基于结果的奖励使得难以对中间步骤进行信用分配。以往的方法主要集中于通过传统强化学习技术（如逆强化学习）或使用过程奖励模型提供逐步反馈，来创建密集的奖励信号以指导学习。本文中，我们揭示了LLMs学习动态中的一个根本问题：策略梯度的大小与熵本质上是耦合的，这导致对自信正确动作的更新效率低下，而对不确定动作的大幅更新可能引发不稳定。为解决这一问题，我们提出了熵调制策略梯度（EMPG），该框架基于步骤间的不确定性和最终任务结果重新校准学习信号。EMPG放大对自信正确动作的更新，惩罚自信错误，并减弱来自不确定步骤的更新以稳定探索。此外，我们引入了一个未来清晰度奖励项，鼓励智能体寻找更具可预测性的解决路径。通过在WebShop、ALFWorld和Deep Search这三个具有挑战性的智能体任务上的全面实验，我们证明EMPG实现了显著的性能提升，并大幅超越了强大的策略梯度基线。项目页面位于https://empgseed-seed.github.io/。

English

In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

驾驭不确定性：面向长程LLM智能体的熵调制策略梯度

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

摘要

Support