먼 것이 위로 보이는 이유: 시각-언어 모델의 공간 표상 탐구

초록

비전-언어 모델(VLM)은 공간 추론 벤치마크에서 강력한 성능을 달성하지만, 이것이 구조화된 3D 이해를 반영하는지, 아니면 자연 이미지의 통계적 지름길에 의존하는지는 여전히 불분명하다. 우리는 VLM 임베딩 내에서 공간 축이 어떻게 조직화되고 분리되는지 측정하기 위해 최소 대조 쌍을 구성하는 표현 수준 분석 프레임워크를 도입한다. 여러 모델 군에 걸친 분석 결과, 모델이 이미지의 수직 위치와 거리를 혼동하며 자연 사진의 원근 편향을 반영하는 일관된 수직-거리 얽힘을 발견했다. 이러한 편향은 원근 일치 예와 반-휴리스틱 예 사이에 상당한 정확도 차이를 초래하며, 전반적인 벤치마크 정확도가 향상됨에도 불구하고 데이터 스케일링 하에서 더욱 심화된다. 또한 유사한 벤치마크 점수를 가진 모델이 서로 다른 내부 표현을 가질 수 있으며, 이러한 차이가 다양한 공간 추론 벤치마크에서의 정확도와 견고성을 예측한다는 것을 보여준다. 이 편향을 평가 세트 편향으로부터 분리하기 위해, 자연 이미지에 존재하는 일반적인 상관관계를 제거하여 공간 지름길 편향을 드러내도록 설계된 합성 벤치마크인 SpatialTunnel을 도입한다. 실험 결과는 얽힘이 모델 내재적임을 확인하며, 공간 축이 잘 분리된 모델이 더 큰 견고성을 보여, 잘 구조화된 공간 표현이 다양한 벤치마크에서 더 신뢰할 수 있는 공간 추론으로 이어진다는 것을 시사한다. 코드와 벤치마크는 프로젝트 페이지(https://cheolhong0916.github.io/whyfarlooksup.github.io/)에서 확인할 수 있다.

English

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.