SPACENUM: VLMs에서의 공간적 수치 이해 재고

초록

시각-언어 모델(VLM)이 구현 환경에 점점 더 많이 배치되면서, 행동 크기나 공간 좌표와 같은 수치 출력을 생성해야 하는 상황에 직면하고 있다. 이러한 숫자들은 의미 있는 것처럼 보이지만, 이러한 수치 출력이 실제로 공간 인식에 기반한 것인지는 여전히 불분명하다. 따라서 본 연구에서는 SpaceNum이라는 통합 프레임워크를 통해 공간적 수치 이해를 재조명한다. SpaceNum은 공간 탐색 중 나타나는 동적 전환으로서의 숫자와 공간 추론에서의 정적 배치로서의 숫자라는 두 가지 상호 보완적 설정을 포착한다. 우리는 Num2Space와 Space2Num이라는 두 가지 양방향 과제를 정식화하여, VLM이 시각 측의 공간 구조와 언어 측의 수치 표현 간의 매핑을 얼마나 잘 수행하는지 평가한다. 현재의 VLM이 공간 설정에서 수치 값을 진정으로 이해하는지 체계적으로 연구한다. 동적 전환과 정적 배치 모두에서, 모델들은 대부분 숫자를 공간적 의미로 grounding하지 못하며, 종종 무작위 추측에 가까운 성능을 보인다. 오류 분석, 추론 과정 분석, 통제된 중재 실험을 통해, 현재 VLM은 표면적 공간 단서에 크게 의존하고, 안정적인 좌표 인식 표현을 구축하는 데 어려움을 겪으며, 시각 관찰로부터 구조화된 공간 배치를 추상화하지 못함을 보여준다. 또한 명시적 추론은 미미한 개선만 제공하는 반면, 튜닝은 공간적 수치 이해를 부분적으로 향상시키고 외부 공간 추론 벤치마크로 전이될 수 있음을 추가로 보여준다.

English

Vision-Language Models (VLMs) are increasingly deployed in embodied environments, where they need produce numerical outputs such as action magnitudes and spatial coordinates. Although these numbers appear meaningful, it remains unclear whether these numerical outputs are genuinely grounded in spatial perception. Therefore, in this work, we revisit spatial numerical understanding through SpaceNum, a unified framework that captures two complementary settings: numbers as dynamic transitions during spatial exploration, and numbers as static layouts in spatial reasoning. We formulate two bidirectional tasks, Num2Space and Space2Num, to evaluate how well VLMs map between vision-side spatial structure and language-side numerical representations. We systematically study whether current VLMs truly understand numerical values in spatial settings. Across dynamic transitions and static layouts, we find that models largely fail to ground numbers in spatial meaning and often perform close to random guess. Through error analysis, reasoning trace analysis, and controlled interventions, we show that current VLMs rely heavily on shallow spatial cues, struggle to build stable coordinate-aware representations, and fail to abstract structured spatial layouts from visual observations. We further show that explicit reasoning provides only marginal gains, while tuning can partially improve spatial numerical understanding and transfer to external spatial reasoning benchmarks.