Loc3R-VLM: 視覚言語モデルによる言語ベースの位置推定と3D推論

要旨

マルチモーダル大規模言語モデル（MLLM）は視覚と言語の連携において目覚ましい進展を見せているが、空間理解と視点を考慮した推論には依然として課題を残している。近年の研究は、モデルに3D空間推論を明示的に教えるのではなく、幾何学的な手がかりで入力表現を拡張する方向に進んでいる。本論文では、単眼ビデオ入力から高度な3D理解能力を2D視覚言語モデルに付与するフレームワーク「Loc3R-VLM」を提案する。人間の空間認知にヒントを得たLoc3R-VLMは、シーン構造の全体的な表現を構築するための大域的レイアウト再構成と、自己中心的な視点を定位するための明示的な状況モデリングという、二つの連携目標に依存している。これらの目標は、知覚と言語を3D文脈に接地させる直接的な空間的監督を提供する。幾何学的一貫性とメートル法スケールの整合性を確保するため、事前学習済み3D基盤モデルから抽出した軽量なカメラ姿勢事前情報を活用する。Loc3R-VLMは、言語に基づく位置推定において最先端の性能を達成し、状況に基づく3D質問応答および一般的な3D質問応答ベンチマークにおいて、既存の2Dおよびビデオベースの手法を凌駕する。これは我々の空間的監督フレームワークが強力な3D理解を可能にすることを実証している。プロジェクトページ: https://kevinqu7.github.io/loc3r-vlm

English

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

Loc3R-VLM: 視覚言語モデルによる言語ベースの位置推定と3D推論

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

要旨

Support