G^2VLM:基於幾何的視覺語言模型——融合統一3D重建與空間推理
G^2VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning
November 26, 2025
作者: Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang
cs.AI
摘要
視覺語言模型在空間智能方面仍缺乏穩健性,在空間理解與推理任務上表現欠佳。我們認為此差距源於缺乏能從二維圖像重建三維空間的視覺幾何學習過程。本文提出G^2VLM——一個基於幾何建構的視覺語言模型,其橋接了空間智能的兩個核心維度:三維空間重建與空間理解。G^2VLM原生利用學習得到的三維視覺幾何特徵,既能直接預測三維屬性,也可透過情境學習與交錯推理增強空間推理任務。我們的統一架構具備高度可擴展的空間理解能力:既能基於海量多視角圖像與影片數據進行訓練,同時受益於通常僅能透過難以採集的標註數據獲取的三維視覺先驗知識。實驗結果表明,G^2VLM在雙重任務中均表現卓越:在三維重建任務上達到與現有前饋式頂尖模型相當的效果,在空間理解與推理任務中則取得更優或具競爭力的成績。通過將語義能力強大的視覺語言模型與底層三維視覺任務相融合,我們期望G^2VLM能成為該領域的強力基準,並開啟更多未來應用(如三維場景編輯)。
English
Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G^2VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G^2VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G^2VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G^2VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.