視覺語言模型距離視覺空間智能還有多遠？基於基準測試的視角

摘要

視覺空間推理（Visual Spatial Reasoning, VSR）是人類核心的認知能力，也是推動具身智能與自主系統發展的關鍵需求。儘管視覺-語言模型（Vision-Language Models, VLMs）近期取得了進展，但由於三維空間表示與推理的複雜性，實現人類水平的VSR仍然極具挑戰性。本文系統性地探討了VLMs中的VSR，涵蓋了對現有方法論的全面回顧，包括輸入模態、模型架構、訓練策略及推理機制。此外，我們將空間智能劃分為三個能力層級，即基礎感知、空間理解與空間規劃，並構建了SIBench——一個涵蓋近20個開源數據集、跨越23種任務設置的空間智能基準。通過對最先進VLMs的實驗，我們發現感知與推理之間存在顯著差距：模型在基礎感知任務中表現出色，但在理解與規劃任務中持續表現不佳，特別是在數值估計、多視角推理、時間動態與空間想像方面。這些發現凸顯了實現空間智能所面臨的重大挑戰，同時為該領域的未來研究提供了系統性的路線圖與全面的基準。本研究的相關資源可於https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/ 獲取。

English

Visual Spatial Reasoning (VSR) is a core human cognitive ability and a critical requirement for advancing embodied intelligence and autonomous systems. Despite recent progress in Vision-Language Models (VLMs), achieving human-level VSR remains highly challenging due to the complexity of representing and reasoning over three-dimensional space. In this paper, we present a systematic investigation of VSR in VLMs, encompassing a review of existing methodologies across input modalities, model architectures, training strategies, and reasoning mechanisms. Furthermore, we categorize spatial intelligence into three levels of capability, ie, basic perception, spatial understanding, spatial planning, and curate SIBench, a spatial intelligence benchmark encompassing nearly 20 open-source datasets across 23 task settings. Experiments with state-of-the-art VLMs reveal a pronounced gap between perception and reasoning, as models show competence in basic perceptual tasks but consistently underperform in understanding and planning tasks, particularly in numerical estimation, multi-view reasoning, temporal dynamics, and spatial imagination. These findings underscore the substantial challenges that remain in achieving spatial intelligence, while providing both a systematic roadmap and a comprehensive benchmark to drive future research in the field. The related resources of this study are accessible at https://sibench.github.io/Awesome-Visual-Spatial-Reasoning/.

視覺語言模型距離視覺空間智能還有多遠？基於基準測試的視角

How Far are VLMs from Visual Spatial Intelligence? A Benchmark-Driven Perspective

摘要

Support