让几何学在空间推理中发挥关键作用

摘要

得益于大规模训练赋能，视觉语言模型（VLM）在图像与视频理解方面表现出色，但其在静态场景和动态视频中进行空间推理的能力仍显不足。近期研究尝试通过将预训练三维基础模型中的几何标记注入VLM来突破这一局限。然而我们发现，此类工作中简单的标记融合与标准微调常导致几何线索在空间推理中未被充分利用，因为VLM往往过度依赖二维视觉特征。本文提出GeoSR框架，通过激励VLM主动运用几何标记进行推理，真正发挥几何信息的作用。该框架包含两个核心组件：（1）几何激发掩码机制，通过在训练中策略性掩蔽部分二维视觉标记，弱化非几何捷径，迫使模型借助几何标记完成空间推理；（2）几何引导融合机制，采用门控路由策略，在几何证据关键区域自适应增强几何标记的贡献度。这些设计共同释放了几何标记在空间推理任务中的潜力。在静态与动态空间推理基准上的大量实验表明，GeoSR通过有效利用几何信息，持续超越现有方法并创下性能新纪录。项目页面详见https://suhzhang.github.io/GeoSR/。

English

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.