探索的メモリ拡張LLMエージェントのためのハイブリッド方策オン・オフ最適化手法

要旨

探索は、強化学習で訓練された大規模言語モデルエージェントの主要なボトルネックである。従来手法は事前学習済み知識を活用するが、新規状態の発見を必要とする環境では機能しない。本論文では、探索的メモリ拡張オン・オフポリシー最適化（EMPO²）を提案する。これはメモリを探索に活用するハイブリッド強化学習フレームワークであり、オン・オフポリシー更新を組み合わせることで、メモリあり場合のLLMの高性能化と、メモリなし場合の堅牢性の両立を実現する。ScienceWorldとWebShopにおける実験では、EMPO²はGRPOに対しそれぞれ128.6%、11.3%の性能向上を達成した。さらに、分布外テストでは、EMPO²は新しいタスクへの優れた適応性を示し、メモリを用いた少数試行のみでパラメータ更新なしの適応を実現した。これらの結果は、EMPO²がより探索的で一般化可能なLLMベースエージェント構築の有望なフレームワークであることを示している。

English

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO^2), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO^2 as a promising framework for building more exploratory and generalizable LLM-based agents.

探索的メモリ拡張LLMエージェントのためのハイブリッド方策オン・オフ最適化手法

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

要旨

Support