공간 지능형 다중 모달 대규모 언어 모델을 위한 비디오 기반 기하학적 표현 학습

초록

다중 모달 대규모 언어 모델(MLLM)은 2차원 의미 이해에 뛰어나지만 본질적인 3차원 인식이 부족하여 비디오 프레임 간 기하학적 및 공간적 일관성을 유지하지 못하는 표현을 초래한다. 대규모 3차원 데이터의 부족을 고려하여, 우리는 순수 2차원 비디오 시퀀스를 사용하여 기하학적 표현을 학습하는 새로운 프레임워크인 GeoVR을 제시한다. 이 접근법은 MLLM 내 의미 잠재 공간을 효과적으로 재구성하여 공간 지능을 활성화한다. GeoVR은 피상적인 특징 혼합을 사용하는 대신, 사전 훈련된 3차원 기초 모델로부터 기하학 지식을 증류함으로써 MLLM의 내부 표현을 재구성한다. 이는 네 가지 상호 보완적 기하학적 목표에 의해 구동되는 다중 목표 학습 전략을 통해 달성된다: (1) 다양한 시점 역학을 내장하기 위한 프레임 간 카메라 자세 추정, (2) 물리적 거리를 고정하기 위한 밀집 깊이 맵 회귀, (3) 실제 세계 보정을 위한 미터법 스케일 인자 예측, (4) 중간 특징 공간을 정렬하기 위한 다중 스케일 3차원 특징 증류. 이러한 명시적 물리적 및 기하학적 제약의 안내를 받아 모델의 내부 표현은 자연스럽게 강력한 3차원 인식을 발달시킨다. 공간 추론 벤치마크에 대한 광범위한 실험을 통해 GeoVR이 최첨단 성능을 달성하여 기초 모델에 공간 지능을 부여하는 새로운 패러다임을 확립함을 보여준다.

English

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.