3D世界のためのビデオからの学習：3D視覚ジオメトリ事前知識によるMLLMの強化

要旨

従来の研究では、3Dシーンをビデオとして解釈することで、マルチモーダル大規模言語モデル（MLLM）の3D理解への応用が検討されてきました。これらのアプローチは一般的に、点群や再構築された鳥瞰図（BEV）マップなどの包括的な3Dデータ入力を必要とします。本研究では、追加の3D入力を必要とせず、ビデオデータから直接3D空間を理解し推論するMLLMの能力を向上させることで、この分野を前進させます。我々は、ビデオ-3Dジオメトリ大規模言語モデル（VG LLM）という新規で効率的な手法を提案します。このアプローチでは、ビデオシーケンスから3D事前情報を抽出する3D視覚ジオメトリエンコーダを採用し、この情報を視覚トークンと統合してMLLMに入力します。大規模な実験により、我々の手法がビデオソースから直接学習した3Dシーン理解と空間推論に関連する様々なタスクにおいて大幅な改善を達成したことが示されました。特に、明示的な3Dデータ入力を必要としない我々の4Bモデルは、既存の最先端手法と比較して競争力のある結果を達成し、VSI-Bench評価においてGemini-1.5-Proを上回る性能を示しました。

English

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

3D世界のためのビデオからの学習：3D視覚ジオメトリ事前知識によるMLLMの強化

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

要旨

Support