宝可梦代理挑战赛:大规模竞技与长上下文学习
The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
March 16, 2026
作者: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin
cs.AI
摘要
我们推出PokeAgent挑战赛——一个基于《宝可梦》多智能体对战系统与广阔角色扮演游戏(RPG)环境构建的大规模决策研究基准。部分可观测性、博弈论推理与长程规划仍是前沿人工智能亟待解决的问题,但现有基准鲜少能在真实场景下同时考验这三项能力。PokeAgent通过两个互补赛道规模化突破这些局限:对战赛道要求参与者在竞争性宝可梦对战中实现部分可观测条件下的战略推理与泛化能力;速通赛道则要求在宝可梦RPG中完成长程规划与序列决策。我们对战赛道提供超过2000万条对战轨迹数据集,以及具备高水平竞技能力的启发式、强化学习与基于大语言模型的基线系统;速通赛道首创RPG速通标准化评估框架,包含开源多智能体编排系统,可对基于封装器的大语言模型方法进行模块化、可复现的对比。NeurIPS 2025竞赛结果验证了我们资源的质量及研究社区对宝可梦的热情:超百支队伍参与双赛道角逐,获奖方案细节详见论文。参赛提交结果与基线系统表明,通用模型(大语言模型)、专用模型(强化学习)与人类顶尖水平存在显著差距。通过BenchPress评估矩阵分析显示,宝可梦对战能力与标准大语言模型基准近乎正交,可衡量现有测试集未覆盖的能力维度,使其成为能推动强化学习与大语言模型研究的未解基准。我们已将其转化为可持续更新的动态基准,提供对战实时排行榜与速通一体化评估平台,详见https://pokeagentchallenge.com。
English
We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.