PokeGym: 시각-언어 모델을 위한 시각 정보 기반 장기 과제 벤치마크

초록

비전-언어 모델(VLM)은 정적 시각 이해 분야에서 놀라운 발전을 이루었지만, 복잡한 3D 구현 환경에서의 적용은 여전히 심각하게 제한적입니다. 기존 벤치마크는 네 가지 중대한 결함을 지니고 있습니다: (1) 수동적 인지 과제가 상호작용 역학을 회피한다는 점; (2) 단순화된 2D 환경이 깊이 인식 평가에 실패한다는 점; (3) 특권 상태 누출이 진정한 시각 처리를 우회한다는 점; (4) 인간 평가가 비용이 과도하고 확장이 불가능하다는 점입니다. 본 논문에서는 시각적으로 복잡한 3D 오픈 월드 롤플레잉 게임인 Pokemon Legends: Z-A 환경에 구현된 시각 기반 장기 과제 벤치마크인 PokeGym을 소개합니다. PokeGym은 엄격한 코드 수준의 격리를 적용합니다: 에이전트는 순수 RGB 관측만을 통해 작동하며, 독립적인 평가자가 메모리 스캐닝을 통해 성공을 검증함으로써 순수 시각 기반 의사 결정과 자동화된 확장 가능한 평가를 보장합니다. 본 벤치마크는 내비게이션, 상호작용, 복합 시나리오를 아우르는 30개 과제(30-220단계)로 구성되며, 시각적 접지, 의미론적 추론, 자율적 탐색 능력을 체계적으로 분석하기 위해 세 가지 지시 세분화 수준(시각 안내, 단계 안내, 목표만 제공)을 제공합니다. 평가 결과 현재 VLM의 주요 한계점이 드러났습니다: 높은 수준의 계획보다는 물리적 교착 상태 회복이 주요 병목 현상이며, 교착 상태는 과제 성공률과 강한 음의 상관관계를 보입니다. 더 나아가 메타인지적 분기 현상을 발견했습니다: 약한 모델은 주로 인지 불가 교착 상태(갇힘을 인지하지 못함)에 시달리는 반면, 고급 모델은 인지 가능 교착 상태(갇힘을 인지하지만 회복에 실패함)를 나타냅니다. 이러한 발견들은 VLM 아키텍처에 명시적 공간 직관을 통합할 필요성을 강조합니다. 코드와 벤치마크는 GitHub에서 공개될 예정입니다.

English

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

PokeGym: 시각-언어 모델을 위한 시각 정보 기반 장기 과제 벤치마크

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

초록

Support