SpatialAct: 探索3D场景中VLM智能体的空间推理到行动能力

摘要

人类能够毫不费力地感知空间布局、构建认知表征、推理空间关系，并将这种推理转化为日常三维环境中的行动。尽管最近的视觉语言模型（VLMs）在基于观测的空间感知与推理任务中展现出令人鼓舞的性能，但尚不明确它们能否构建连贯的空间理解、据此采取行动，并通过多轮反馈优化自身行为。为研究这一问题，我们提出了SpatialAct——一个基于模拟器、用于探究三维场景中动作条件空间推理的基准测试。从最具挑战性的场景（多轮交互式优化）出发，我们进一步设计了其分解形式（单步错误检测与修复），并辅以五项基础空间能力任务，用以诊断模型失败的潜在原因。实验揭示出“推理-行动”之间存在的明显差距：当前VLM在孤立的空间推理任务中表现良好，但在多轮反馈过程中难以维持连贯的空间信念、无法产生可靠的动作，其表现远逊于人类。这些结果表明，即使将底层控制抽象化，当前的VLM智能体仍缺乏在动作引发的环境变化下进行稳健空间状态跟踪的能力。

English

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.