眼見非知:視覺語言模型是否知道何時不應回答空間問題(及原因)?
Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?
May 28, 2026
作者: Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal
cs.AI
摘要
空間推理是部署於真實環境中的視覺語言模型(VLM)的一項基本能力。然而,視覺觀察本質上只是對三維世界的有限表徵:遮擋可能使物體不可見,而視角可能使幾何屬性產生誤導。儘管如此,現有的空間推理基準通常假設觀察是充分且可靠的,重點關注模型能否給出正確答案,而非模型是否能夠識別出問題無法回答的情況以及需要哪些額外的觀察。在本研究中,我們透過建構一個受控的評估框架「SpatialUncertain」來挑戰此假設,並引入兩種觀察挑戰:(1) 遮擋,它會隱藏目標資訊;(2) 視角模糊,它會產生誤導性的視覺線索。針對每種配置,我們設計了在清晰觀察下可回答、但在引入的挑戰下需要拒答的空間問題。我們進一步評估模型是否能識別出哪些額外視角可以解除視角模糊。我們對一系列多樣的前沿開源與閉源視覺語言模型的結果揭示了兩種一致的失效模式。首先,模型傾向於過度自信地回答,即使視覺證據不完整或具有誤導性,仍試圖解決空間推理任務,在遮擋下的平均準確率約為30%,在視角模糊下則低於10%。其次,即使有額外視角可用,部分模型在辨別哪些視角能提供可靠證據時的表現接近隨機水準。綜合來看,我們的研究結果呼籲應超越答案正確性,轉向評估模型是否知道何時應當拒答,以及如何尋求可靠的證據。
English
Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.