ChatPaper.aiChatPaper

SpatialAct:探討三維場景中VLM智能體的空間推理至行動能力

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

May 29, 2026
作者: Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui
cs.AI

摘要

人類能輕鬆感知空間佈局、建立認知表徵、推理空間關係,並將此類推理轉化為日常三維環境中的行動。儘管近期的視覺語言模型(VLM)在基於觀察的空間感知與推理任務上展現出令人期待的表現,但其能否構建連貫的空間理解、據此採取行動,並透過多輪回饋修正行動,仍是未知數。為探討此問題,我們提出 SpatialAct——一個基於模擬環境的基準測試,用於探討三維場景中基於行動的空間推理。從最具挑戰性的設定「多輪互動式改進」出發,我們進一步設計其分解任務「單步錯誤檢測與修正」,並搭配五項基礎空間能力任務,以診斷模型失敗的根本原因。實驗結果揭示了明確的「推理到行動」差距:現有 VLM 雖能在孤立空間推理任務上表現良好,但在多輪回饋中難以維持連貫的空間信念、產出可靠行動,表現明顯遜於人類。這些結果表明,即使抽象化低階控制,當前 VLM 智能體在行動引發的環境變化下,仍缺乏穩健的空間狀態追蹤能力。
English
Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.