SleepWalk：指示に基づく視覚言語ナビゲーションをストレステストするための三層ベンチマーク

要旨

視覚言語モデル（VLM）は、マルチモーダル知覚と言語理解において急速に進歩してきたが、3Dデジタル環境において言語を空間的に首尾一貫し、実行可能と思われる行動に確実に接地できるかは未だ明らかではない。本稿では、テキストによるシーン記述から生成され、ナビゲーション可能性に基づいてフィルタリングされた単一シーン型3D世界において、指示に基づく軌道予測を評価するベンチマークSleepWalkを提案する。SleepWalkは、部屋を横断する長距離探索を中心とした従来のナビゲーションベンチマークとは異なり、局所的で相互作用を中心とした具現化推論を対象とする。すなわち、レンダリングされた視覚的観測と自然言語による指示が与えられたとき、モデルはシーンの幾何学を尊重し、衝突を回避し、行動に適した位置で終了する軌道を予測しなければならない。本ベンチマークは多様な屋内・屋外環境を網羅し、タスクを空間的・時間的難易度の3段階に分類することで、構成の複雑性が増す中での接地の詳細な分析を可能にする。標準化されたポイントワイズの判定者ベース評価プロトコルを用いて、3つの最先端VLMを、シーンあたり9つの指示を持つ2,472の厳選された3D環境で評価した。結果は、接地された空間推論における系統的な失敗を明らかにする。特に、遮蔽、相互作用の制約、多段階指示の下で顕著であり、タスクの難易度が上がるにつれて性能は低下する。概して、現在のVLMは、空間的に首尾一貫し、実行可能と思われ、意図された行動と一致する軌道をある程度生成できるものの、制御可能でありながらスケーラブルな設定で失敗を明らかにすることにより、SleepWalkは、3D環境における接地されたマルチモーダル推論、具現化計画、視覚言語ナビゲーション、および行動可能なエージェントを前進させるための重要なベンチマークを提供する。

English

Vision-Language Models (VLMs) have advanced rapidly in multimodal perception and language understanding, yet it remains unclear whether they can reliably ground language into spatially coherent, plausibly executable actions in 3D digital environments. We introduce SleepWalk, a benchmark for evaluating instruction-grounded trajectory prediction in single-scene 3D worlds generated from textual scene descriptions and filtered for navigability. Unlike prior navigation benchmarks centered on long-range exploration across rooms, SleepWalk targets localized, interaction-centric embodied reasoning: given rendered visual observations and a natural-language instruction, a model must predict a trajectory that respects scene geometry, avoids collisions, and terminates at an action-compatible location. The benchmark covers diverse indoor and outdoor environments and organizes tasks into three tiers of spatial and temporal difficulty, enabling fine-grained analysis of grounding under increasing compositional complexity. Using a standardized pointwise judge-based evaluation protocol, we evaluate three frontier VLMs on 2,472 curated 3D environments with nine instructions per scene. Results reveal systematic failures in grounded spatial reasoning, especially under occlusion, interaction constraints, and multi-step instructions: performance drops as the difficulty level of the tasks increase. In general, current VLMs can somewhat produce trajectories that are simultaneously spatially coherent, plausibly executable, and aligned with intended actions. By exposing failures in a controlled yet scalable setting, SleepWalk provides a critical benchmark for advancing grounded multimodal reasoning, embodied planning, vision-language navigation, and action-capable agents in 3D environments.

SleepWalk：指示に基づく視覚言語ナビゲーションをストレステストするための三層ベンチマーク

SleepWalk: A Three-Tier Benchmark for Stress-Testing Instruction-Guided Vision-Language Navigation

要旨

Support