하이브리드 온-폴리시 및 오프-폴리시 최적화를 통한 탐색적 메모리 강화 LLM 에이전트

초록

강화학습으로 훈련된 대규모 언어 모델 에이전트의 핵심 병목 현상은 여전히 탐색(exploration)입니다. 기존 방법들은 사전 훈련된 지식을 활용하지만, 새로운 상태 발견이 필요한 환경에서는 실패합니다. 본 연구에서는 메모리를 활용한 탐색과 온-정책 및 오프-정책 업데이트를 결합한 하이브리드 강화학습 프레임워크인 EMPO²를 제안합니다. 이를 통해 LLM이 메모리를 활용할 때 우수한 성능을 발휘하면서도, 메모리 없이도 강건성을 보장합니다. ScienceWorld와 WebShop 환경에서 EMPO²는 GRPO 대비 각각 128.6%, 11.3%의 성능 향상을 달성했습니다. 또한 분포 외 테스트에서 EMPO²는 새로운 작업에 대한 우수한 적응력을 보였으며, 메모리를 활용한 소수 시행만으로 매개변수 업데이트 없이도 효과적으로 수행했습니다. 이러한 결과는 EMPO²가 보다 탐색적이고 일반화 능력이 뛰어난 LLM 기반 에이전트 구축을 위한 유망한 프레임워크임을 입증합니다.

English

Exploration remains the key bottleneck for large language model agents trained with reinforcement learning. While prior methods exploit pretrained knowledge, they fail in environments requiring the discovery of novel states. We propose Exploratory Memory-Augmented On- and Off-Policy Optimization (EMPO^2), a hybrid RL framework that leverages memory for exploration and combines on- and off-policy updates to make LLMs perform well with memory while also ensuring robustness without it. On ScienceWorld and WebShop, EMPO^2 achieves 128.6% and 11.3% improvements over GRPO, respectively. Moreover, in out-of-distribution tests, EMPO^2 demonstrates superior adaptability to new tasks, requiring only a few trials with memory and no parameter updates. These results highlight EMPO^2 as a promising framework for building more exploratory and generalizable LLM-based agents.

하이브리드 온-폴리시 및 오프-폴리시 최적화를 통한 탐색적 메모리 강화 LLM 에이전트

Exploratory Memory-Augmented LLM Agent via Hybrid On- and Off-Policy Optimization

초록

Support