なぜ遠くが上に見えるのか：視覚言語モデルにおける空間表現の探求

要旨

Vision-Language Models (VLM)は空間推論ベンチマークにおいて高い性能を示すが、これが構造化された三次元理解を反映しているのか、それとも自然画像における統計的な近道（ショートカット）に依存しているのかは依然として不明である。我々は、VLMの埋め込み内で空間軸がどのように編成され、分離されているかを測定するために、最小限の対照ペアを構築する表現レベルの分析フレームワークを導入する。複数のモデルファミリーにわたる我々の分析は、一貫した垂直-距離の絡み合い（vertical-distance entanglement）を明らかにする。すなわち、モデルは画像内の垂直位置と距離を混同しており、これは自然写真の遠近バイアスを反映している。このバイアスは、遠近法的に一貫した例と反ヒューリスティックな例との間に顕著な精度差を生み出し、データスケーリングの下で、全体のベンチマーク精度が向上するにつれて強まる。さらに、類似したベンチマークスコアを持つモデルでも異なる内部表現を示すことがあり、これらの差異が多様な空間推論ベンチマークにおける精度とロバスト性を予測することを示す。このバイアスを評価セットの偏りから切り離すために、我々はSpatialTunnelを導入する。これは、自然画像に存在する一般的な相関を取り除くことで空間的ショートカットバイアスを露呈するように設計された合成ベンチマークである。実験により、この絡み合いがモデル固有のものであり、空間軸が適切に分離されたモデルがより高いロバスト性を示すことが確認された。これは、よく構造化された空間表現が多様なベンチマークにわたってより信頼性の高い空間推論につながることを示唆している。コードとベンチマークはプロジェクトページ（https://cheolhong0916.github.io/whyfarlooksup.github.io/）で公開されている。

English

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.