SPACENUM: VLMにおける空間的数値理解の再考

要旨

視覚言語モデル（VLM）は、行動の大きさや空間座標などの数値出力を必要とする具現化環境への展開が進んでいる。これらの数値は一見意味を持つように見えるが、そうした数値出力が実際に空間知覚に基づいているかは不明である。そこで本研究では、空間探索における動的遷移としての数値と、空間推論における静的レイアウトとしての数値という、相補的な二つの設定を捉える統一フレームワークSpaceNumを通じて、空間的な数値理解を再検討する。VLMが視覚側の空間構造と言語側の数値表現との間をどのようにマッピングするかを評価するため、双方向のタスクであるNum2SpaceとSpace2Numを定式化する。現在のVLMが空間設定における数値を真に理解しているかを体系的に調査する。動的遷移と静的レイアウトの両方において、モデルは数値を空間的な意味に根付かせることにほとんど失敗しており、多くがランダムな推測に近い性能を示すことがわかった。誤り分析、推論過程の分析、および制御された介入を通じて、現在のVLMは浅い空間的手がかりに過度に依存し、安定した座標認識表現を構築するのに苦戦し、視覚観測から構造化された空間レイアウトを抽象化できないことを示す。さらに、明示的な推論はわずかな改善しかもたらさず、一方でチューニングは空間的数値理解を部分的に改善し、外部の空間推論ベンチマークへ転移可能であることを示す。

English

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.