動画から幾何学的表現を学習する空間知能マルチモーダル大規模言語モデル

要旨

マルチモーダル大規模言語モデル（MLLM）は2次元の意味理解に優れているものの、本質的な3次元認識を欠いており、その結果、ビデオフレーム間で幾何学的および空間的一貫性を維持できない表現を生み出す。大規模な3次元データの不足を踏まえ、我々はGeoVRを提案する。これは純粋な2次元ビデオシーケンスのみを用いて幾何学的表現を学習する新規フレームワークである。本手法はMLLM内の意味的潜在空間を効果的に再構築し、空間知能を解放する。表面的な特徴混合に頼るのではなく、GeoVRは事前学習済みの3次元基盤モデルから幾何学知識を蒸留することでMLLMの内部表現を再形成する。これは四つの補完的な幾何学目標によって駆動される多目的学習戦略により達成される：(1)フレーム間カメラ姿勢の推定による変化する視点ダイナミクスの埋め込み、(2)密な深度マップの回帰による物理的距離の固定、(3)実世界キャリブレーションのためのメートルスケール係数の予測、(4)マルチスケール3次元特徴の蒸留による中間特徴空間の整列。これらの明示的な物理的・幾何学的制約に導かれ、モデルの内部表現は自然に強力な3次元認識を発展させる。空間推論ベンチマークにおける広範な実験により、GeoVRは最先端の性能を達成し、基盤モデルに空間知能を付与する新たなパラダイムを確立することが示される。

English

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.