PokeGym：面向视觉语言模型的视觉驱动长周期基准测试框架

摘要

尽管视觉语言模型（VLMs）在静态视觉理解方面取得了显著进展，但其在复杂3D具身环境中的部署仍存在严重局限。现有基准测试存在四个关键缺陷：（1）被动感知任务规避了交互动态；（2）简化的2D环境无法评估深度感知能力；（3）特权状态泄露绕过了真实的视觉处理过程；（4）人工评估成本高昂且难以扩展。我们推出PokeGym——一个基于《宝可梦传说：Z-A》视觉复杂3D开放世界角色扮演游戏构建的视觉驱动长时程基准测试。该框架通过代码级隔离设计：智能体仅基于原始RGB观测进行操作，同时由独立评估器通过内存扫描验证任务成功率，确保纯视觉决策与自动化可扩展评估。基准测试包含30项任务（30-220步），涵盖导航、交互及混合场景，并设置三种指令粒度（视觉引导、步骤引导、仅目标指引），以系统解构视觉定位、语义推理与自主探索能力。评估结果揭示了当前VLMs的核心局限：物理死锁恢复能力（而非高层规划）构成主要瓶颈，且死锁状态与任务成功率呈强负相关。更重要的是，我们发现了元认知分化：弱模型主要受困于无意识死锁（陷入困境而不自知），而先进模型则表现为有意识死锁（能识别困境但无法脱困）。这些发现表明亟需将显式空间直觉整合至VLM架构中。代码与基准测试套件将在GitHub平台开源。

English

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.