SleepWalk: 用于压力测试指令引导视觉语言导航的三级基准

摘要

视觉-语言模型（VLMs）在多模态感知与语言理解方面取得了显著进展，但其能否在3D数字环境中可靠地将语言锚定到空间连贯且可执行的动作上，仍是一个未解之谜。为此，我们提出SleepWalk基准，用于评估基于指令的地面轨迹预测能力，该预测在由文本场景描述生成并经过可通行性筛选的单场景3D世界中进行。与以往侧重于跨房间长距离探索的导航基准不同，SleepWalk聚焦于局部化、以交互为中心的具身推理：模型需根据渲染的视觉观察与自然语言指令，预测一条尊重场景几何结构、避免碰撞、并终止于可执行动作位置的轨迹。该基准涵盖多样的室内外环境，并将任务按空间与时间难度分为三个层级，从而在组合复杂度递增的条件下实现对语言锚定能力的精细分析。通过标准化的逐点评分评估协议，我们在2,472个精心策划的3D环境中（每个场景含9条指令）评估了三款前沿视觉-语言模型。结果揭示了其在地面空间推理中的系统性失败，尤其在遮挡、交互约束及多步指令条件下表现显著：随着任务难度提升，模型性能明显下降。总体而言，当前视觉-语言模型尚能生成部分兼顾空间连贯性、可执行性及与意图动作一致性的轨迹。通过在可控且可扩展的设置中暴露失败案例，SleepWalk为推进3D环境中的地面多模态推理、具身规划、视觉-语言导航及可执行动作智能体提供了关键基准。

English

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

SleepWalk: 用于压力测试指令引导视觉语言导航的三级基准

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

摘要

Support