让几何学在空间推理中发挥关键作用

摘要

得益于大规模训练赋能，视觉语言模型（VLM）在图像与视频理解方面表现出色，但其在静态场景与动态视频中进行空间推理的能力仍存在局限。最新研究尝试通过将预训练三维基础模型中的几何标记注入VLM来突破此限制。然而我们发现，这类研究采用的简单标记融合与标准微调方法常导致几何线索在空间推理中未被充分利用，因为VLM往往过度依赖二维视觉线索。本文提出GeoSR框架，通过激励VLM主动运用几何标记进行推理，使几何特征真正发挥作用。该框架包含两大核心组件：（1）几何释放掩码机制——在训练中策略性遮蔽部分二维视觉标记，削弱非几何捷径的干扰，迫使模型借助几何标记完成空间推理；（2）几何引导融合机制——采用门控路由策略，在几何证据关键区域自适应增强几何标记的贡献度。这些设计共同释放了几何标记在空间推理任务中的潜力。在静态与动态空间推理基准测试上的大量实验表明，GeoSR通过有效利用几何信息，持续超越现有方法并创下性能新纪录。项目页面详见https://suhzhang.github.io/GeoSR/。

English

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

让几何学在空间推理中发挥关键作用

Make Geometry Matter for Spatial Reasoning

摘要

Support