駕馭不確定性：基於熵調製策略梯度的長視野大語言模型代理

摘要

在長期任務中，基於大型語言模型（LLMs）的智能體面臨一個重大挑戰：稀疏的、基於結果的獎勵使得難以對中間步驟進行信用分配。以往的方法主要集中於創建密集的獎勵信號來引導學習，無論是通過傳統的強化學習技術如逆強化學習，還是使用過程獎勵模型來提供逐步反饋。本文中，我們發現了LLMs學習動態中的一個根本問題：策略梯度的幅度與熵本質上耦合，這導致對自信正確動作的更新效率低下，而對不確定動作的更新可能不穩定。為解決這一問題，我們提出了熵調製策略梯度（EMPG），這是一個基於步驟不確定性和最終任務結果重新校準學習信號的框架。EMPG放大對自信正確動作的更新，懲罰自信錯誤，並減弱來自不確定步驟的更新以穩定探索。我們進一步引入了一個未來清晰度獎勵項，鼓勵智能體尋找更可預測的解決路徑。通過在WebShop、ALFWorld和Deep Search這三個具有挑戰性的智能體任務上的全面實驗，我們展示了EMPG實現了顯著的性能提升，並顯著超越了強勁的策略梯度基線。項目頁面位於https://empgseed-seed.github.io/。

English

In long-horizon tasks, recent agents based on Large Language Models (LLMs) face a significant challenge that sparse, outcome-based rewards make it difficult to assign credit to intermediate steps. Previous methods mainly focus on creating dense reward signals to guide learning, either through traditional reinforcement learning techniques like inverse reinforcement learning or by using Process Reward Models for step-by-step feedback. In this paper, we identify a fundamental problem in the learning dynamics of LLMs: the magnitude of policy gradients is inherently coupled with the entropy, which leads to inefficient small updates for confident correct actions and potentially destabilizes large updates for uncertain ones. To resolve this, we propose Entropy-Modulated Policy Gradients (EMPG), a framework that re-calibrates the learning signal based on step-wise uncertainty and the final task outcome. EMPG amplifies updates for confident correct actions, penalizes confident errors, and attenuates updates from uncertain steps to stabilize exploration. We further introduce a bonus term for future clarity that encourages agents to find more predictable solution paths. Through comprehensive experiments on three challenging agent tasks, WebShop, ALFWorld, and Deep Search, we demonstrate that EMPG achieves substantial performance gains and significantly outperforms strong policy gradient baselines. Project page is at https://empgseed-seed.github.io/

駕馭不確定性：基於熵調製策略梯度的長視野大語言模型代理

Harnessing Uncertainty: Entropy-Modulated Policy Gradients for Long-Horizon LLM Agents

摘要

Support