PokeRL: 포켓몬 레드를 위한 강화 학습

초록

포켓몬 레드는 보상이 드물고 부분 관측 가능성과 독특한 조작 메커니즘을 가진 장기 호라이즌 JRPG로, 강화 학습에 있어 도전적인 벤치마크입니다. 최근 연구에서는 PPO 에이전트가 강력한 보상 형성과 설계된 관측을 통해 첫 두 개의 체육관을 클리어할 수 있음이 입증되었지만, 실제 훈련은 취약하여 에이전트가 종종 액션 루프, 메뉴 스팸 또는 비생산적인 배회에 빠집니다. 본 논문에서는 포켓몬 레드의 초반 게임 임무(플레이어 집 탈출, 팔레트 타운 탐색 및 높은 풀밭 도달, 첫 번째 라이벌 전투 승리)를 수행하는 딥 강화 학습 에이전트를 훈련시키는 모듈형 시스템인 PokeRL을 제시합니다. 우리의 주요 기여는 맵 마스킹 기능이 포함된 PyBoy 에뮬레이터용 루프 인식 환경 래퍼, 다중 계층의 반-루프 및 반-스팸 메커니즘, 그리고 조밀한 계층적 보상 설계입니다. 루프 및 스팸과 같은 실패 모드를 명시적으로 모델링하는 PokeRL과 같은 실용적인 시스템은 토이 벤치마크와 포켓몬 리그 챔피언 에이전트 사이의 필수 중간 단계라고 주장합니다. 코드는 https://github.com/reddheeraj/PokemonRL에서 확인할 수 있습니다.

English

Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL

PokeRL: 포켓몬 레드를 위한 강화 학습

PokeRL: Reinforcement Learning for Pokemon Red

초록

Support