為何遠方看起來在上:探究視覺語言模型中的空間表徵
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
May 28, 2026
作者: Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park
cs.AI
摘要
視覺語言模型(VLM)在空間推理基準測試上表現出色,然而這究竟反映的是結構化的三維理解,抑或只是依賴自然影像中的統計捷徑,目前仍不明確。我們提出一套表徵層級分析框架,透過構建最小對比對來衡量空間軸在 VLM 嵌入中是如何組織與解耦的。針對多個模型家族的實驗分析揭示了一致的垂直距離糾纏現象:模型將影像中的垂直位置與距離混為一談,這正好反映了自然照片中的視角偏誤。此偏誤導致在符合視角預期與反啟發式範例之間出現顯著的準確率差距,而且即使整體基準準確率持續提升,該偏誤仍會隨資料擴增而加劇。我們進一步指出,基準分數相近的模型可能展現出不同的內部表徵,而這些差異能預測其在多樣空間推理基準上的準確率與穩健性。為了將此偏誤與評估集的偏差區分開來,我們提出 SpatialTunnel 這套合成基準,其設計目的是透過消除自然影像中常見的相關性來揭露空間捷徑偏誤。實驗結果證實,該糾纏現象是模型本質的,而那些空間軸分離良好的模型展現出更高的穩健性,這意味著結構良好的空間表徵能在多樣基準上帶來更可靠的空間推理能力。程式碼與基準資料集請見專案頁面:https://cheolhong0916.github.io/whyfarlooksup.github.io/。
English
Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.