Waarom Far Omhoog Kijkt: Het Onderzoeken van Ruimtelijke Representatie in Visie-Taalmodellen

Samenvatting

Vision-taalmodellen (VLM's) behalen sterke prestaties op ruimtelijke redeneerbenchmarks, maar het blijft onduidelijk of dit gestructureerd 3D-begrip weerspiegelt of een beroep op statistische shortcuts in natuurlijke afbeeldingen. We introduceren een representatie-niveau analysekader dat minimale contrastparen construeert om te meten hoe ruimtelijke assen georganiseerd en ontward zijn binnen VLM-embeddings. Onze analyse over meerdere modelfamilies onthult een consistente verticale-afstandsverstrengeling: modellen verwarren verticale beeldpositie met afstand, wat de perspectiefbias van natuurlijke foto's weerspiegelt. Deze bias veroorzaakt een significant nauwkeurigheidsverschil tussen perspectief-consistente en contraintuïtieve voorbeelden, en versterkt onder dataschaalvergroting, zelfs terwijl de algehele benchmarknauwkeurigheid verbetert. We tonen verder aan dat modellen met vergelijkbare benchmarkscores verschillende interne representaties kunnen vertonen, en dat deze verschillen nauwkeurigheid en robuustheid voorspellen over diverse ruimtelijke redeneerbenchmarks. Om deze bias te isoleren van scheefheid in de evaluatieset, introduceren we SpatialTunnel, een synthetische benchmark ontworpen om ruimtelijke shortcut-biases bloot te leggen door gangbare correlaties in natuurlijke afbeeldingen te verwijderen. Experimenten bevestigen dat de verstrengeling model-intrinsiek is, en dat modellen met goed gescheiden ruimtelijke assen een grotere robuustheid vertonen, wat suggereert dat goed gestructureerde ruimtelijke representaties leiden tot betrouwbaardere ruimtelijke redenering over diverse benchmarks. Code en benchmark zijn beschikbaar op de projectpagina: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

English

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.