ChatPaper.aiChatPaper

SpatiaLab:视觉语言模型能否在真实场景中实现空间推理?

SpatiaLab: Can Vision-Language Models Perform Spatial Reasoning in the Wild?

February 3, 2026
作者: Azmine Toushik Wasi, Wahid Faisal, Abdur Rahman, Mahfuz Ahmed Anik, Munem Shahriar, Mohsin Mahmud Topu, Sadia Tasnim Meem, Rahatun Nesa Priti, Sabrina Afroz Mitu, Md. Iqramul Hoque, Shahriyar Zaman Ridoy, Mohammed Eunus Ali, Majd Hawasly, Mohammad Raza, Md Rizwan Parvez
cs.AI

摘要

空间推理是人类认知的基本能力,但仍是当代视觉语言模型(VLM)面临的主要挑战。现有研究多依赖合成或大语言模型生成的环境,其任务设计有限且呈谜题式结构,未能捕捉VLM在真实场景中遇到的复杂视觉噪声与多样化空间关系。为此,我们推出SpatiaLab——一个在真实无约束场景下评估VLM空间推理能力的综合基准。该基准包含1,400个视觉问答对,涵盖相对定位、深度与遮挡、方向判定、尺寸与比例、空间导航及三维几何六大类别,每个类别下设五个子类,共形成30种任务类型。每个子类至少包含25道题目,主类别题目量均超过200道,支持多项选择与开放式评估。通过对开源/闭源模型、专注推理的模型及专用空间推理模型等多类前沿VLM的实验,发现其空间推理能力与人类存在显著差距:在多项选择测试中,InternVL3.5-72B准确率为54.93%,而人类达87.57%;开放式测试中所有模型性能下降约10-25%,最佳模型GPT-5-mini仅获40.93%,人类则为64.93%。这些结果揭示了VLM在处理复杂空间关系、深度感知、导航及三维几何方面的核心局限。通过提供多样化的真实场景评估框架,SpatiaLab不仅揭示了推进VLM空间推理能力的关键挑战与机遇,更为未来研究实现鲁棒且符合人类认知的空间理解提供了基准导向。SpatiaLab已开源:https://spatialab-reasoning.github.io/。
English
Spatial reasoning is a fundamental aspect of human cognition, yet it remains a major challenge for contemporary vision-language models (VLMs). Prior work largely relied on synthetic or LLM-generated environments with limited task designs and puzzle-like setups, failing to capture the real-world complexity, visual noise, and diverse spatial relationships that VLMs encounter. To address this, we introduce SpatiaLab, a comprehensive benchmark for evaluating VLMs' spatial reasoning in realistic, unconstrained contexts. SpatiaLab comprises 1,400 visual question-answer pairs across six major categories: Relative Positioning, Depth & Occlusion, Orientation, Size & Scale, Spatial Navigation, and 3D Geometry, each with five subcategories, yielding 30 distinct task types. Each subcategory contains at least 25 questions, and each main category includes at least 200 questions, supporting both multiple-choice and open-ended evaluation. Experiments across diverse state-of-the-art VLMs, including open- and closed-source models, reasoning-focused, and specialized spatial reasoning models, reveal a substantial gap in spatial reasoning capabilities compared with humans. In the multiple-choice setup, InternVL3.5-72B achieves 54.93% accuracy versus 87.57% for humans. In the open-ended setting, all models show a performance drop of around 10-25%, with GPT-5-mini scoring highest at 40.93% versus 64.93% for humans. These results highlight key limitations in handling complex spatial relationships, depth perception, navigation, and 3D geometry. By providing a diverse, real-world evaluation framework, SpatiaLab exposes critical challenges and opportunities for advancing VLMs' spatial reasoning, offering a benchmark to guide future research toward robust, human-aligned spatial understanding. SpatiaLab is available at: https://spatialab-reasoning.github.io/.
PDF91February 6, 2026