Loc3R-VLM：基於語言定位與三維推理的視覺語言模型

摘要

多模态大语言模型（MLLM）在视觉与语言连接方面取得了显著进展，但在空间理解和视角感知推理方面仍存在不足。当前研究主要致力于通过几何线索增强输入表征，而非直接教导模型进行三维空间推理。我们提出Loc3R-VLM框架，该框架通过单目视频输入赋予二维视觉语言模型先进的三维理解能力。受人类空间认知机制启发，Loc3R-VLM采用两个联合目标：通过全局布局重建构建场景结构的整体表征，利用显式情境建模锚定自我中心视角。这些目标通过将感知与语言共同置于三维语境中，提供了直接的空间监督。为确保几何一致性和度量尺度对齐，我们采用从预训练三维基础模型中提取的轻量级相机位姿先验。Loc3R-VLM在基于语言的定位任务中达到最先进性能，并在情境化及通用三维问答基准测试中超越现有基于二维和视频的方法，证明我们的空间监督框架能实现强大的三维理解能力。项目页面：https://kevinqu7.github.io/loc3r-vlm

English

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

Loc3R-VLM：基於語言定位與三維推理的視覺語言模型

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

摘要

Support