Loc3R-VLM: 비전-언어 모델 기반 언어 기반 위치 인식 및 3D 추론

초록

멀티모달 대규모 언어 모델(MLLMs)은 시각과 언어의 연결에서 인상적인 진전을 이루었으나, 여전히 공간 이해 및 시점 인식 추론에 어려움을 겪고 있습니다. 최근 연구들은 모델에 3D 공간 추론을 명시적으로 가르치기보다 기하학적 단서를 입력 표현에 추가하는 방향으로 진행되고 있습니다. 본 논문에서는 단안 비디오 입력으로부터 2D 시각-언어 모델에 고급 3D 이해 능력을 부여하는 Loc3R-VLM 프레임워크를 소개합니다. 인간의 공간 인지에서 영감을 받은 Loc3R-VLM은 장면 구조의 전체적 표현을 구축하는 전역 레이아웃 재구성과 자기 중심적 시점을 고정하는 명시적 상황 모델링이라는 두 가지 공동 목표에 기반합니다. 이러한 목표는 지각과 언어를 3D 맥락에 직접적으로 연결하는 공간적 감독을 제공합니다. 기하학적 일관성과 미터법 규모 정렬을 보장하기 위해, 사전 훈련된 3D 기초 모델에서 추출한 경량 카메라 자세 사전 정보를 활용합니다. Loc3R-VLM은 언어 기반 위치 인식에서 최첨단 성능을 달성하며, 상황 인식 및 일반 3D 질의응답 벤치마크에서 기존 2D 및 비디오 기반 접근법을 능가함으로써 우리의 공간 감독 프레임워크가 강력한 3D 이해를 가능하게 함을 입증합니다. 프로젝트 페이지: https://kevinqu7.github.io/loc3r-vlm

English

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching models to reason in 3D space. We introduce Loc3R-VLM, a framework that equips 2D Vision-Language Models with advanced 3D understanding capabilities from monocular video input. Inspired by human spatial cognition, Loc3R-VLM relies on two joint objectives: global layout reconstruction to build a holistic representation of the scene structure, and explicit situation modeling to anchor egocentric perspective. These objectives provide direct spatial supervision that grounds both perception and language in a 3D context. To ensure geometric consistency and metric-scale alignment, we leverage lightweight camera pose priors extracted from a pre-trained 3D foundation model. Loc3R-VLM achieves state-of-the-art performance in language-based localization and outperforms existing 2D- and video-based approaches on situated and general 3D question-answering benchmarks, demonstrating that our spatial supervision framework enables strong 3D understanding. Project page: https://kevinqu7.github.io/loc3r-vlm

Loc3R-VLM: 비전-언어 모델 기반 언어 기반 위치 인식 및 3D 추론

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

초록

Support