面向空间智能多模态大型语言模型的视频几何表示学习

摘要

多模态大语言模型（MLLMs）在二维语义理解方面表现优异，但本质缺乏三维空间感知能力，导致其表示在视频帧间无法保持几何与空间一致性。针对大规模三维数据稀缺的问题，我们提出了GeoVR——一种仅利用二维视频序列学习几何表示的新型框架。该方法通过重构MLLM内部的语义潜在空间来解锁空间智能，并非采用浅层特征融合策略，而是通过从预训练三维基础模型中蒸馏几何知识来重塑MLLM的内部表征。这一过程通过多目标学习策略实现，由四个互补几何目标驱动：（1）估计帧间相机位姿以嵌入视角动态变化，（2）回归稠密深度图以锚定物理距离，（3）预测度量尺度因子以实现真实世界校准，（4）蒸馏多尺度三维特征以对齐中间特征空间。在显式物理与几何约束引导下，模型内部表征自然形成强大的三维感知能力。在空间推理基准上的大量实验表明，GeoVR取得了最优性能，为赋予基础模型空间智能建立了新范式。

English

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.