PokeGym：面向视觉语言模型的视觉驱动长程任务基准测试集

摘要

尽管视觉语言模型（VLMs）在静态视觉理解领域取得了显著进展，但其在复杂三维具身环境中的部署仍存在严重局限。现有基准测试存在四大缺陷：（1）被动感知任务规避了交互动态；（2）简化的二维环境无法评估深度感知能力；（3）特权状态泄露绕过了真实的视觉处理过程；（4）人工评估成本高昂且难以扩展。我们推出PokeGym——一个基于《宝可梦传说 Z-A》视觉复杂三维开放世界角色扮演游戏构建的视觉驱动长程基准测试框架。该框架通过代码级严格隔离：智能体仅能处理原始RGB观测数据，而独立评估器通过内存扫描验证任务成功率，确保决策完全基于视觉感知并实现自动化可扩展评估。基准测试包含导航、交互及混合场景三大类共30项任务（30-220步长），提供视觉引导、步骤引导与纯目标三种指令粒度，系统化解构视觉定位、语义推理与自主探索能力。评估结果揭示了当前VLMs的核心局限：物理僵局恢复能力（而非高层规划能力）构成主要瓶颈，且僵局出现频率与任务成功率呈强负相关。更重要的是，我们发现元认知分化现象：弱模型主要受困于无意识僵局（陷入困境而不自知），而先进模型则表现出意识僵局（能识别困境却无法脱困）。这些发现表明亟需将显式空间直觉整合至VLM架构中。代码与基准测试集将在GitHub平台开源。

English

While Vision-Language Models (VLMs) have achieved remarkable progress in static visual understanding, their deployment in complex 3D embodied environments remains severely limited. Existing benchmarks suffer from four critical deficiencies: (1) passive perception tasks circumvent interactive dynamics; (2) simplified 2D environments fail to assess depth perception; (3) privileged state leakage bypasses genuine visual processing; and (4) human evaluation is prohibitively expensive and unscalable. We introduce PokeGym, a visually-driven long-horizon benchmark instantiated within Pokemon Legends: Z-A, a visually complex 3D open-world Role-Playing Game. PokeGym enforces strict code-level isolation: agents operate solely on raw RGB observations while an independent evaluator verifies success via memory scanning, ensuring pure vision-based decision-making and automated, scalable assessment. The benchmark comprises 30 tasks (30-220 steps) spanning navigation, interaction, and mixed scenarios, with three instruction granularities (Visual-Guided, Step-Guided, Goal-Only) to systematically deconstruct visual grounding, semantic reasoning, and autonomous exploration capabilities. Our evaluation reveals a key limitation of current VLMs: physical deadlock recovery, rather than high-level planning, constitutes the primary bottleneck, with deadlocks showing a strong negative correlation with task success. Furthermore, we uncover a metacognitive divergence: weaker models predominantly suffer from Unaware Deadlocks (oblivious to entrapment), whereas advanced models exhibit Aware Deadlocks (recognizing entrapment yet failing to recover). These findings highlight the need to integrate explicit spatial intuition into VLM architectures. The code and benchmark will be available on GitHub.

PokeGym：面向视觉语言模型的视觉驱动长程任务基准测试集

PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models

摘要

Support