從視頻中學習三維世界：利用三維視覺幾何先驗增強多模態大語言模型

摘要

先前的研究已探讨了多模态大语言模型（MLLMs）通过将三维场景解读为视频来理解其应用。这些方法通常依赖于全面的三维数据输入，如点云或重建的鸟瞰图（BEV）。在本研究中，我们通过增强MLLMs直接从视频数据中理解和推理三维空间的能力，无需额外三维输入，推动了该领域的发展。我们提出了一种新颖且高效的方法——视频三维几何大语言模型（VG LLM）。该方法采用三维视觉几何编码器，从视频序列中提取三维先验信息，并将此信息与视觉标记整合后输入MLLM。大量实验表明，我们的方法在直接从视频源学习的多种三维场景理解与空间推理任务中取得了显著进步。尤为引人注目的是，我们的4B模型不依赖显式三维数据输入，在VSI-Bench评估中不仅与现有最先进方法竞争，甚至超越了Gemini-1.5-Pro。

English

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

從視頻中學習三維世界：利用三維視覺幾何先驗增強多模態大語言模型

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

摘要

Support