포켓에이전트 챌린지: 대규모 경쟁적·장문맥 학습

초록

우리는 포켓몬의 다중 에이전트 배틀 시스템과 방대한 롤플레잉 게임(RPG) 환경을 기반으로 의사 결정 연구를 위한 대규모 벤치마크인 PokeAgent Challenge를 소개한다. 부분 관찰 가능성, 게임 이론적 추론, 장기 계획 수립은 최첨단 AI의 미해결 과제로 남아 있지만, 현실적인 조건 하에서 이 세 가지를 동시에 검증하는 벤치마크는 거의 없다. PokeAgent는 상호 보완적인 두 가지 트랙을 통해 이러한 한계를 대규모로 해결한다: 경쟁적 포켓몬 배틀에서 부분 관찰 가능성 하의 전략적 추론과 일반화를 요구하는 Battling Track과, 포켓몬 RPG에서 장기 계획 수립과 순차적 의사 결정을 요구하는 Speedrunning Track이 그것이다. Battling Track은 2천만 개 이상의 배틀 궤적 데이터셋과 고수준 경쟁 플레이가 가능한 휴리스틱, 강화학습(RL), LLM 기반 베이스라인 모델군을 제공한다. Speedrunning Track은 RPG 스피드런에 대한 최초의 표준화된 평가 프레임워크를 제공하며, 모듈화된 하네스 기반 LLM 접근법의 재현 가능한 비교를 위한 오픈소스 다중 에이전트 오케스트레이션 시스템을 포함한다. NeurIPS 2025 경쟁은 우리 자원의 질과 포켓몬에 대한 연구 커뮤니티의 관심을 입증했으며, 두 트랙에 100개 이상의 팀이 참가했고 우승 솔루션은 논문에 상세히 기술되었다. 참가자 제출물과 우리의 베이스라인은 범용 모델(LLM), 전문 모델(RL), 엘리트 인간 수행 간에 상당한 격차가 있음을 보여준다. BenchPress 평가 매트릭스에 따른 분석은 포켓몬 배틀이 표준 LLM 벤치마크와 거직 직교적 관계에 있음을 보여주며, 기존 평가군이 포착하지 못한 능력을 측정함으로써 포켓몬을 RL 및 LLM 연구를 앞당길 미해결 벤치마크로 위치시킨다. 우리는 https://pokeagentchallenge.com에서 Battling을 위한 실시간 리더보드와 Speedrunning을 위한 독립형 평가 시스템을 갖춘 지속적 벤치마크로 전환한다.

English

We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.

포켓에이전트 챌린지: 대규모 경쟁적·장문맥 학습

The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

초록

Support