为何远处看似在上方:探究视觉-语言模型中的空间表征
Why Far Looks Up: Probing Spatial Representation in Vision-Language Models
May 28, 2026
作者: Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park
cs.AI
摘要
视觉-语言模型(VLM)在空间推理基准测试中表现出色,但这究竟反映了其具备结构化的三维理解能力,还是依赖于自然图像中的统计捷径,目前尚不明确。我们提出了一种表征级分析框架,通过构建最小对比对来衡量空间轴在VLM嵌入中的组织方式与解耦程度。对多个模型家族的分析揭示了一个一致的垂直距离纠缠现象:模型将垂直图像位置与距离混为一谈,这反映了自然照片的透视偏差。这种偏差导致视角一致样本与反启发式样本之间存在显著的准确率差距,并且即使整体基准准确率随着数据规模扩大而提升,这一偏差也会加剧。我们进一步证明,在基准测试中得分相近的模型,其内部表征可能不同,并且这些差异能够预测模型在多种空间推理基准测试中的准确率与鲁棒性。为将该偏差与评估集偏差相隔离,我们提出了SpatialTunnel——一个旨在通过消除自然图像中常见相关性来暴露空间捷径偏差的合成基准测试。实验证实,该纠缠是模型内在的特性,且空间轴分离良好的模型展现出更强的鲁棒性,这表明结构良好的空间表征能够使模型在多样化的基准测试中实现更可靠的空间推理。代码和基准测试已发布在项目页面:https://cheolhong0916.github.io/whyfarlooksup.github.io/。
English
Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.