從視頻學習幾何表示以用於空間智能多模態大型語言模型

摘要

多模態大型語言模型（MLLMs）在二維語義理解上表現卓越，但其本質缺乏三維感知能力，導致表徵無法維持跨視訊影格的幾何與空間一致性。針對大規模三維數據稀缺的問題，我們提出GeoVR——一個僅利用二維視訊序列學習幾何表徵的新型框架。該方法能有效重構MLLM內部的語義潛在空間，從而釋放其空間智慧。GeoVR並非採用淺層特徵混合，而是透過從預訓練的三維基礎模型中蒸餾幾何知識，重塑MLLM的內部表徵。此過程藉由多目標學習策略實現，由四個互補的幾何目標驅動：（1）估計影格間相機姿態以嵌入動態視角變化；（2）回歸稠密深度圖以錨定物理距離；（3）預測度量比例因子以進行真實世界校準；（4）蒸餾多尺度三維特徵以對齊中間特徵空間。在這些明確的物理與幾何約束引導下，模型的內部表徵自然形成了強烈的三維感知能力。在空間推理基準上的大量實驗證明，GeoVR達到了最先進的效能，為賦予基礎模型空間智慧建立了新典範。

English

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.