SPACENUM：重新探討視覺語言模型中的空間數值理解

摘要

視覺語言模型（Vision-Language Models, VLMs）正日益部署於具身環境中，在此類環境下，模型需產出如動作幅度與空間座標等數值輸出。儘管這些數字看似具有意義，但其是否真正植基於空間感知仍有待釐清。因此，本研究透過SpaceNum統一框架重新審視空間數值理解，該框架涵蓋兩種互補情境：數值作為空間探索中的動態轉換，以及數值作為空間推理中的靜態佈局。我們設計了Num2Space與Space2Num兩項雙向任務，用以評估VLM在視覺空間結構與語言數值表徵之間的映射能力。我們系統性地探究當前VLM是否真正理解空間情境中的數值意義。結果顯示，在動態轉換與靜態佈局中，模型大多未能將數值植基於空間含義，其表現常接近隨機猜測。透過錯誤分析、推理軌跡分析與控制干預，我們發現當前VLM高度依賴淺層空間線索，難以建立穩定的座標感知表徵，且無法從視覺觀測中抽象出結構化的空間佈局。我們進一步指出，顯式推理僅能帶來邊際效益，而微調則可部分改善空間數值理解，並能遷移至外部空間推理基準。

English

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.