从视频中学习构建3D世界：利用3D视觉几何先验增强多模态大语言模型

摘要

先前的研究已探讨了多模态大语言模型（MLLMs）通过将三维场景解读为视频来理解其内容的应用。这些方法通常依赖于全面的三维数据输入，如点云或重建的鸟瞰图（BEV）。在本研究中，我们推动了这一领域的发展，增强了MLLMs直接从视频数据中理解和推理三维空间的能力，而无需额外的三维输入。我们提出了一种新颖且高效的方法——视频三维几何大语言模型（VG LLM）。该方法采用三维视觉几何编码器，从视频序列中提取三维先验信息，并将这些信息与视觉标记整合后输入MLLM。大量实验表明，我们的方法在直接从视频源学习的各种三维场景理解和空间推理任务中取得了显著进步。令人瞩目的是，我们的4B模型不依赖显式的三维数据输入，却在VSI-Bench评估中与现有最先进方法相比取得了竞争性成果，甚至超越了Gemini-1.5-Pro。

English

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

从视频中学习构建3D世界：利用3D视觉几何先验增强多模态大语言模型

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

摘要

Support