ポケエージェントチャレンジ：大規模な競争的・長文脈学習の取り組み

要旨

我々は、Pokemonのマルチエージェントバトルシステムと広大なロールプレイングゲーム（RPG）環境を基盤とした、意思決定研究のための大規模ベンチマーク「PokeAgent Challenge」を提案する。部分観測性、ゲーム理論的推論、長期計画立案は、最先端AIにとって未解決の問題であり続けているが、これら3つを現実的な条件下で同時に評価するベンチマークはほとんど存在しない。PokeAgentは、2つの相補的なトラックを通じて、これらの限界に大規模に取り組む。すなわち、競争的なPokemonバトルにおいて部分観測性下での戦略的推論と一般化を求める「Battling Track」と、Pokemon RPGにおける長期計画立案と逐次的意思決定を必要とする「Speedrunning Track」である。Battling Trackは、2000万以上のバトル軌跡データセットと、高水準の競技プレイが可能なヒューリスティック、強化学習（RL）、LLMベースのベースライン一式を提供する。Speedrunning Trackは、RPGスピードラン向け初の標準化された評価フレームワークを提供し、ハーネスベースのLLMアプローチのモジュール化された再現可能な比較のためのオープンソースのマルチエージェントオーケストレーションシステムを含む。我々のNeurIPS 2025競技会は、両トラックで100以上のチームが参加し、論文で詳細を解説する優勝ソリューションが生まれたことにより、本リソースの質と研究コミュニティのPokemonへの関心の高さを実証している。参加者の提出物と我々のベースラインは、ジェネラリスト（LLM）、スペシャリスト（RL）、エリート人間のパフォーマンスの間にかなりの隔たりがあることを示す。BenchPress評価マトリックスに対する分析は、Pokemonバトルが標準的なLLMベンチマークとほぼ直交しており、既存の評価スイートでは捕捉できない能力を測定し、RLおよびLLM研究を推進する未解決のベンチマークとしてPokemonを位置づけている。我々は、Battling用のライブリーダーボードと、Speedrunning用の独立した評価環境をhttps://pokeagentchallenge.com で提供し、本ベンチマークを継続的に更新する「ライブィングベンチマーク」へと移行する。

English

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

ポケエージェントチャレンジ：大規模な競争的・長文脈学習の取り組み

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

要旨

Support