SleepWalk: 지시 기반 시각-언어 내비게이션을 스트레스 테스트하기 위한 3계층 벤치마크

초록

시각-언어 모델(VLM)은 다중 모드 인식과 언어 이해에서 빠르게 발전해 왔지만, 이러한 모델이 3D 디지털 환경에서 언어를 공간적으로 일관되고 실행 가능한 동작에 안정적으로 기반을 둘 수 있는지는 여전히 불분명하다. 우리는 텍스트 장면 설명으로 생성되고 이동 가능성을 기준으로 필터링된 단일 장면 3D 세계에서 지시 기반 궤적 예측을 평가하기 위한 벤치마크인 SleepWalk를 소개한다. 방을 넘나드는 장거리 탐색에 초점을 맞춘 기존의 내비게이션 벤치마크와 달리, SleepWalk는 지역화되고 상호작용 중심의 체화된 추론을 목표로 한다. 렌더링된 시각적 관찰과 자연어 지시가 주어지면, 모델은 장면 기하학을 존중하고 충돌을 피하며 동작 가능한 위치에서 종료되는 궤적을 예측해야 한다. 이 벤치마크는 다양한 실내 및 실외 환경을 포괄하며, 작업을 공간적 및 시간적 난이도의 세 가지 계층으로 구성하여 구성적 복잡성이 증가함에 따른 기반의 세분화된 분석을 가능하게 한다. 표준화된 점별 판정 기반 평가 프로토콜을 사용하여, 우리는 2,472개의 선별된 3D 환경(장면당 9개의 지시)에서 세 가지 최첨단 VLM을 평가한다. 결과는 특히 폐색, 상호작용 제약, 다단계 지시 하에서 기반을 둔 공간 추론의 체계적 실패를 드러낸다. 작업의 난이도가 증가함에 따라 성능이 저하된다. 전반적으로, 현재의 VLM은 공간적으로 일관되고 실행 가능하며 의도된 동작과 일치하는 궤적을 어느 정도 생성할 수 있다. 통제 가능하면서도 확장 가능한 환경에서의 실패를 드러냄으로써, SleepWalk는 3D 환경에서 기반을 둔 다중 모드 추론, 체화된 계획, 시각-언어 내비게이션, 동작 가능 에이전트의 발전을 위한 중요한 벤치마크를 제공한다.

English

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

SleepWalk: 지시 기반 시각-언어 내비게이션을 스트레스 테스트하기 위한 3계층 벤치마크

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

초록

Support