ChatPaper.aiChatPaper

G^2VLM:几何接地的视觉语言模型——统一三维重建与空间推理

G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

November 26, 2025
作者: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
cs.AI

摘要

视觉语言模型在空间智能方面仍缺乏鲁棒性,其在空间理解与推理任务上的表现欠佳。我们认为这一差距源于缺乏能够从二维图像重建三维空间的视觉几何学习过程。本文提出G^2VLM——一种基于几何建模的视觉语言模型,该模型融合了空间智能的两个核心维度:三维空间重建与空间语义理解。G^2VLM原生利用学习得到的三维视觉几何特征,既能直接预测三维属性,又可通过上下文学习与交织推理增强空间推理任务。我们的统一架构在空间理解方面具有高度扩展性:既能利用海量多视角图像和视频数据进行训练,又能受益于通常仅能通过难以获取的标注数据得到的三维视觉先验。实验结果表明,G^2VLM在双重任务中均表现优异,其三维重建效果可与前沿的前馈式三维重建模型相媲美,在空间理解与推理任务中则取得更优或具有竞争力的结果。通过将强语义的视觉语言模型与底层三维视觉任务相融合,我们希望G^2VLM能成为该领域的强基准,并为三维场景编辑等未来应用开启更多可能性。
English
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G^2VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G^2VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G^2VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G^2VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.
PDF82December 1, 2025