PokeRL: ポケモン赤における強化学習

要旨

ポケモン赤は、報酬が疎で、部分観測性があり、特異な操作メカニズムを持つ長期的なJRPGであり、強化学習にとって挑戦的なベンチマークとなっている。最近の研究では、PPOエージェントが強力な報酬設計と工夫を凝らした観測を用いて最初の2つのジムを攻略できることが示されているが、実際の訓練は脆く、エージェントはしばしば行動ループ、メニュースパム、または非生産的な徘徊に陥ってしまう。本論文では、ポケモン赤の序盤タスク（プレイヤーの家からの脱出、パレットタウンの探索による草むら到達、最初のライバル戦での勝利を含む）を完了するように深層強化学習エージェントを訓練するモジュール式システム「PokeRL」を提案する。主な貢献は、PyBoyエミュレータを基盤としマップマスキングを備えたループ認識環境ラッパー、多層的なアンチループ・アンチスパム機構、そして密な階層的報酬設計である。ループやスパムといった失敗モードを明示的にモデル化するPokeRLのような実用的システムは、玩具的なベンチマークと完全なポケモンリーグ制覇エージェントとの間にある必要不可欠な中間段階であると主張する。コードはhttps://github.com/reddheeraj/PokemonRL で公開されている。

English

Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL

PokeRL: ポケモン赤における強化学習

PokeRL: Reinforcement Learning for Pokemon Red

要旨

Support