基于混合策略与离策略优化的探索性记忆增强大语言模型智能体

摘要

探索能力仍是基于强化学习训练的大型语言模型智能体的关键瓶颈。现有方法虽利用预训练知识，但在需要发现新状态的环境中表现不佳。我们提出探索性记忆增强同策略与异策略优化框架（EMPO²），该混合强化学习框架利用记忆进行探索，并结合同策略与异策略更新机制，使LLM既能充分发挥记忆优势，又能在无记忆条件下保持稳健性能。在ScienceWorld和WebShop环境中，EMPO²相较GRPO分别实现128.6%和11.3%的性能提升。此外在分布外测试中，EMPO²展现出卓越的任务适应能力，仅需少量带记忆的试验且无需参数更新即可适应新任务。这些成果表明EMPO²是构建更具探索性和泛化能力的LLM智能体的前瞻性框架。

English

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO^2), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO^2 as a promising framework for building more exploratory and generalizable LLM-based agents.