看见并非知晓:视觉语言模型是否知道何时不应回答空间问题(以及原因)?
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
May 28, 2026
作者: Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal
cs.AI
摘要
空间推理是部署于真实世界环境中的视觉-语言模型(VLM)的一项基础能力。然而,视觉观测本质上是对三维世界的有限表征:遮挡可能使物体不可见,视角可能使几何属性产生误导。尽管如此,现有的空间推理基准通常假设观测信息是充分且可靠的,其关注点在于模型能否给出正确答案,而非模型能否识别出某个问题无法被回答,以及需要哪些额外观测信息。在本工作中,我们通过构建一个受控评估框架 SpatialUncertain 来挑战这一假设,并引入两类观测挑战:(1)遮挡——隐藏目标信息,以及(2)视角歧义——产生具有误导性的视觉线索。针对每种配置,我们设计了相应的空间问题,这些在清晰观测条件下可回答的问题,在引入上述挑战后则需模型选择弃权。我们进一步评估模型能否识别哪些额外视角可消除视角歧义。我们在多种前沿开源与闭源 VLM 上的结果表明,存在两种一致的失败模式。首先,模型倾向于过度自信地回答,即使在视觉证据不完整或具有误导性的情况下仍试图求解空间推理任务,在遮挡条件下平均准确率约为 30%,而在视角歧义条件下低于 10%。其次,即使额外视角可供使用,部分模型在识别哪些视角能提供可靠证据时表现接近随机水平。综合而言,我们的研究结果呼吁超越对答案正确性的关注,转向评估模型是否知晓何时应弃权以及如何寻求可靠证据。
English
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.