SpatialAct: 3D 장면에서 VLM 에이전트의 공간 추론-행동 능력 탐구

초록

인간은 일상적인 3D 환경에서 공간적 배치를 쉽게 인지하고, 인지적 표상을 형성하며, 공간 관계에 대해 추론하고, 그러한 추론을 행동으로 전환할 수 있다. 최근의 비전-언어 모델(VLM)이 관찰 기반 공간 인식 및 추론 작업에서 유망한 성능을 보여주고 있지만, 이들이 일관된 공간 이해를 구축하고 이를 바탕으로 행동하며 다중 턴 피드백을 통해 행동을 개선할 수 있는지는 여전히 불명확하다. 이 문제를 연구하기 위해, 우리는 3D 장면에서 행동 기반 공간 추론을 조사하기 위한 시뮬레이터 기반 벤치마크인 SpatialAct를 소개한다. 가장 도전적인 설정인 다중 턴 상호작용적 개선에서 시작하여, 우리는 모델 실패의 근본 원인을 진단하기 위해 분해된 대응물인 단일 단계 오류 탐지 및 수정과 함께 다섯 가지 기본 공간 능력 과제를 추가로 설계한다. 실험 결과는 명확한 추론-행동 격차를 드러낸다. 현재 VLM은 고립된 공간 추론 작업에서는 우수한 성능을 보이지만, 다중 턴 피드백 중에 일관된 공간 신념을 유지하고 신뢰할 수 있는 행동을 생성하는 데 어려움을 겪어 인간보다 현저히 낮은 성능을 보인다. 이러한 결과는 현재의 VLM 에이전트가 저수준 제어가 추상화된 경우에도 행동으로 인한 환경 변화 하에서 강건한 공간 상태 추적이 여전히 부족함을 시사한다.

English

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce SpatialAct, a simulator-grounded benchmark for probing action-conditioned spatial reasoning in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.