Retroformer：具有策略梯度优化的回顾式大型语言代理

摘要

最近几个月出现了一个强大的新趋势，即将大型语言模型（LLMs）增强为自主语言代理，能够独立执行面向目标的多步任务，而不仅仅是回应人类用户的查询。然而，大多数现有的语言代理并未使用特定于环境的奖励进行优化。虽然一些代理允许通过口头反馈进行迭代改进，但它们并未以与基于梯度学习奖励兼容的方式进行推理和规划。本文介绍了一个有原则的框架，通过学习一个回顾模型来加强大型语言代理，该模型通过策略梯度自动调整语言代理提示以适应环境反馈。具体来说，我们提出的代理架构从多个环境和任务中学习奖励，用于微调预训练的语言模型，通过总结先前失败尝试的根本原因并提出行动计划来完善语言代理提示。各种任务的实验结果表明，语言代理随时间改善，我们的方法明显优于未充分利用环境梯度的基准线。这表明使用策略梯度优化改进语言代理是有前景的，我们相信我们的工作是首批之一，并且可以应用于优化代理架构中的其他模型，以随时间提升代理性能。

English

Recent months have seen the emergence of a powerful new trend in which large language models (LLMs) are augmented to become autonomous language agents capable of performing objective oriented multi-step tasks on their own, rather than merely responding to queries from human users. Most existing language agents, however, are not optimized using environment-specific rewards. Although some agents enable iterative refinement through verbal feedback, they do not reason and plan in ways that are compatible with gradient-based learning from rewards. This paper introduces a principled framework for reinforcing large language agents by learning a retrospective model, which automatically tunes the language agent prompts from environment feedback through policy gradient. Specifically, our proposed agent architecture learns from rewards across multiple environments and tasks, for fine-tuning a pre-trained language model which refines the language agent prompt by summarizing the root cause of prior failed attempts and proposing action plans. Experimental results on various tasks demonstrate that the language agents improve over time and that our approach considerably outperforms baselines that do not properly leverage gradients from the environment. This demonstrates that using policy gradient optimization to improve language agents, for which we believe our work is one of the first, seems promising and can be applied to optimize other models in the agent architecture to enhance agent performances over time.

Retroformer：具有策略梯度优化的回顾式大型语言代理

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

摘要

Support