PokeRL：基于强化学习的《宝可梦：红》游戏策略研究

摘要

《精灵宝可梦：红》作为一款长周期日式角色扮演游戏，其奖励机制稀疏、环境信息部分可观测，加之独特的操控机制，使其成为强化学习领域极具挑战性的基准测试平台。尽管近期研究表明，通过精细设计的奖励函数与观测工程，PPO智能体能够攻克前两个道馆，但实际训练仍存在脆弱性——智能体常陷入动作循环、菜单滥用或无意义游荡等异常行为。本文提出PokeRL模块化系统，该系统通过深度强化学习训练智能体完成《精灵宝可梦：红》早期任务，包括离开玩家房屋、探索真新镇抵达草丛区域以及赢得首场宿敌对战。我们的核心创新在于：基于PyBoy模拟器构建具有地图掩码功能的循环感知环境封装器、多层抗循环与防滥用机制，以及密集分层奖励设计方案。我们认为，像PokeRL这样能显式建模循环与滥用等失败模式的实用系统，是实现从玩具基准测试到完整宝可梦联盟冠军智能体的必要中间阶段。代码已开源：https://github.com/reddheeraj/PokemonRL

English

Pokemon Red is a long-horizon JRPG with sparse rewards, partial observability, and quirky control mechanics that make it a challenging benchmark for reinforcement learning. While recent work has shown that PPO agents can clear the first two gyms using heavy reward shaping and engineered observations, training remains brittle in practice, with agents often degenerating into action loops, menu spam, or unproductive wandering. In this paper, we present PokeRL, a modular system that trains deep reinforcement learning agents to complete early game tasks in Pokemon Red, including exiting the player's house, exploring Pallet Town to reach tall grass, and winning the first rival battle. Our main contributions are a loop-aware environment wrapper around the PyBoy emulator with map masking, a multi-layer anti-loop and anti-spam mechanism, and a dense hierarchical reward design. We argue that practical systems like PokeRL, which explicitly model failure modes such as loops and spam, are a necessary intermediate step between toy benchmarks and full Pokemon League champion agents. Code is available at https://github.com/reddheeraj/PokemonRL