SleepWalk：針對指令引導視覺語言導航進行壓力測試的三層級基準

摘要

视觉-语言模型（VLMs）在多模态感知与语言理解方面取得了显著进展，但其能否在三维数字环境中将语言可靠地落地为空间连贯、可合理执行的行动仍不明确。我们提出SleepWalk基准，用于评估在基于文本场景描述生成并经可导航性过滤的单场景三维世界中，指令驱动的轨迹预测能力。与以往聚焦跨房间长程探索的导航基准不同，SleepWalk以局部化、面向交互的具身推理为目标：模型需根据渲染的视觉观测与自然语言指令，预测一条尊重场景几何结构、避免碰撞并终止于可执行动作位置的轨迹。该基准涵盖多种室内外环境，并将任务按空间与时间复杂度分为三个层级，支持在递增组合难度下对语言落地能力进行细粒度分析。通过标准化逐点评分评估方案，我们在2472个精心构造的三维场景中（每场景包含9条指令）对三个前沿视觉-语言模型进行了评估。结果揭示了落地的空间推理存在系统性失败，尤其在遮挡、交互约束与多步指令场景下：随着任务难度提升，模型表现显著下降。总体而言，当前视觉-语言模型能够在一定程度上生成兼具空间连贯性、合理可执行性且与意图动作一致的轨迹。通过揭示可控且可扩展场景下的失败模式，SleepWalk为推进三维环境中的落地多模态推理、具身规划、视觉-语言导航及可行动智能体研究提供了关键基准。

English

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

SleepWalk：針對指令引導視覺語言導航進行壓力測試的三層級基準

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

摘要

Support