基於混合策略與離策略優化的探索性記憶增強大型語言模型代理

摘要

探索仍是基於強化學習訓練的大型語言模型代理的關鍵瓶頸。現有方法雖能利用預訓練知識，卻在需要發現新狀態的環境中表現不佳。我們提出探索性記憶增強在線離線混合優化框架（EMPO²），該混合強化學習框架利用記憶進行探索，並結合在線與離線策略更新，使大型語言模型既能充分發揮記憶優勢，又能在無記憶條件下保持穩健性能。在ScienceWorld和WebShop環境中，EMPO²相較GRPO分別實現了128.6%和11.3%的性能提升。此外，在分佈外測試中，EMPO²展現出對新任務的卓越適應性，僅需少量帶記憶的試驗且無需參數更新。這些成果表明EMPO²是構建更具探索性與泛化能力的大型語言模型代理的優選框架。

English

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO^2), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO^2 as a promising framework for building more exploratory and generalizable LLM-based agents.

基於混合策略與離策略優化的探索性記憶增強大型語言模型代理

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

摘要

Support