SPACENUM：重新审视视觉语言模型中的空间数值理解

摘要

视觉-语言模型（VLMs）正越来越多地被部署于具身环境中，在此类场景下，它们需要输出数值结果，例如动作幅度和空间坐标。尽管这些数值看似具有意义，但其是否真正根植于空间感知仍不明确。为此，本研究通过SpaceNum这一统一框架重新审视空间数值理解问题，该框架涵盖两种互补设定：空间探索中作为动态变化的数值，以及空间推理中作为静态布局的数值。我们构建了Num2Space和Space2Num两项双向任务，以评估VLM在视觉侧空间结构与语言侧数值表征之间的映射能力。我们系统性地探究了当前VLM是否真正理解空间情境中的数值含义。在动态变化与静态布局两种设定中，我们发现模型普遍未能将数值扎根于空间意义，其表现常接近于随机猜测。通过错误分析、推理轨迹分析及受控干预实验，我们揭示当前VLM严重依赖浅层空间线索，难以建立稳定的坐标感知表征，且无法从视觉观测中抽象出结构化的空间布局。进一步研究表明，显式推理仅带来边际提升，而模型微调可部分改善空间数值理解能力，并迁移至外部空间推理基准。

English

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.