見ることは知ることではない：VLMは空間的質問に答えるべきでない時を（そしてその理由を）知っているか？

要旨

空間推論は、実世界環境に展開される視覚言語モデル（VLM）にとって基本的な能力である。しかし、視覚観察は3次元世界の本質的に限られた表現であり、遮蔽により物体が不可視になったり、遠近法によって幾何学的特性が誤解を招く可能性がある。それにもかかわらず、既存の空間推論ベンチマークは通常、観察が十分かつ信頼できると仮定し、モデルが正しい答えを生成するかどうかに焦点を当てており、質問に答えられない場合を認識できるかどうか、またどのような追加観察が必要かを評価していない。本研究では、この仮定に挑戦し、制御された評価フレームワークであるSpatialUncertainを構築し、2種類の観察上の課題を導入する：（1）対象情報を隠す遮蔽、（2）誤解を招く視覚的手がかりを生み出す遠近法の曖昧性である。各構成において、クリーンな観察下では回答可能であるが、導入された課題下では回答を控える必要がある空間質問を設計する。さらに、モデルが遠近法の曖昧性を解消するためにどの追加視点が有効かを特定できるかも評価する。先端的なオープンソースおよびクローズドソースの多様なVLMを用いた結果から、一貫した2つの失敗モードが明らかになった。第一に、モデルは過信した回答をする傾向があり、視覚的証拠が不完全または誤解を招く場合でも空間推論タスクを解決しようと試み、遮蔽下では平均正解率約30%、遠近法の曖昧性下では10%未満であった。第二に、追加視点が利用可能であっても、一部のモデルは信頼できる証拠を提供する視点の特定においてランダムチャンスに近い性能を示した。これらの知見は、答えの正しさを超えて、モデルがいつ回答を控えるべきか、どのように信頼できる証拠を求めるべきかを評価することの必要性を訴えている。

English

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.