보는 것이 아는 것은 아니다: VLM은 공간 질문에 답하지 말아야 할 때를 아는가(그리고 그 이유는?)

초록

공간 추론은 실제 환경에 배포된 시각-언어 모델(VLM)에게 필수적인 능력이다. 그러나 시각적 관찰은 본질적으로 3차원 세계의 제한된 표현으로, 가림은 객체를 보이지 않게 만들 수 있고 시점은 기하학적 속성을 오해하게 만들 수 있다. 그럼에도 불구하고 기존 공간 추론 벤치마크는 일반적으로 관찰이 충분하고 신뢰할 수 있다고 가정하며, 모델이 올바른 답변을 생성하는지 여부에 초점을 맞출 뿐 질문에 답할 수 없는 시점을 인식하거나 어떤 추가 관찰이 필요한지에 대해서는 다루지 않는다. 본 연구에서는 통제된 평가 프레임워크인 SpatialUncertain을 구축하고 두 가지 유형의 관찰 과제, 즉 (1) 대상 정보를 숨기는 가림과 (2) 오해의 소지가 있는 시각적 단서를 생성하는 시점 모호성을 도입함으로써 이러한 가정에 도전한다. 각 구성에 대해 깨끗한 관찰 하에서는 답변 가능하지만 도입된 과제 하에서는 답변을 유보해야 하는 공간 질문을 설계한다. 또한 모델이 시점 모호성을 해결할 추가 시점이 무엇인지 식별할 수 있는지 평가한다. 최첨단 오픈소스 및 클로즈드소스 VLM 다양한 세트에 걸친 실험 결과, 두 가지 일관된 실패 양상이 드러났다. 첫째, 모델은 과신한 답변을 하기 쉬우며, 시각적 증거가 불완전하거나 오해의 소지가 있음에도 공간 추론 과제를 해결하려 시도하여 가림 하에서는 평균 정확도 약 30%, 시점 모호성 하에서는 10% 미만을 보였다. 둘째, 추가 시점이 제공되더라도 일부 모델은 어떤 시점이 신뢰할 수 있는 증거를 제공할지 식별하는 데 무작위에 가까운 성능을 보였다. 종합적으로, 본 연구의 발견은 답변 정확성을 넘어 모델이 언제 답변을 유보해야 하는지와 신뢰할 수 있는 증거를 어떻게 찾아야 하는지를 아는지 평가하는 방향으로 나아가야 함을 촉구한다.

English

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.