SpatialAct: 3DシーンにおけるVLMエージェントの空間推論から行動への能力の検証

要旨

人間は、日常的な3次元環境において、空間配置を容易に知覚し、認知的表象を形成し、空間的関係について推論し、その推論を行動に変換することができる。近年の視覚言語モデル（VLM）は、観測に基づく空間知覚および推論タスクにおいて有望な性能を示しているが、一貫した空間理解を構築し、それに基づいて行動し、マルチターン・フィードバックを通じて行動を洗練できるかどうかは依然として明らかではない。この問題を研究するため、我々は3Dシーンにおける行動条件付き空間推論を探るためのシミュレータ基盤ベンチマークであるSpatialActを導入する。最も困難な設定であるマルチターン・インタラクティブ・リファインメントから始め、さらにその分解版として、単一ステップのエラー検出と修正、およびモデル障害の根本原因を診断するための5つの基本的な空間能力タスクを設計した。実験により、明確な推論と行動のギャップが明らかになった。すなわち、現在のVLMは個別の空間推論タスクでは良好に機能するが、マルチターン・フィードバックにおいて一貫した空間的信念を維持し信頼性のある行動を生成することに苦戦し、人間に大幅に劣る。これらの結果は、現在のVLMエージェントは、低レベルの制御が抽象化されている場合でも、行動誘発性の環境変化下での頑健な空間状態追跡が欠如していることを示唆している。

English

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.