3D 세계를 위한 비디오 학습: 3D 시각 기하학적 사전 지식을 활용한 MLLM 강화

초록

기존 연구에서는 멀티모달 대형 언어 모델(MLLM)을 비디오로 해석하여 3D 장면 이해에 적용하는 방법을 탐구해왔다. 이러한 접근법은 일반적으로 포인트 클라우드나 재구성된 조감도(BEV) 맵과 같은 포괄적인 3D 데이터 입력에 의존한다. 본 연구에서는 추가적인 3D 입력 없이 비디오 데이터로부터 직접 3D 공간을 이해하고 추론하는 MLLM의 능력을 향상시켜 이 분야를 발전시켰다. 우리는 비디오-3D 기하학 대형 언어 모델(VG LLM)이라는 새로운 효율적인 방법을 제안한다. 이 접근법은 비디오 시퀀스로부터 3D 사전 정보를 추출하는 3D 시각 기하학 인코더를 사용한다. 이 정보는 시각 토큰과 통합되어 MLLM에 입력된다. 광범위한 실험을 통해 우리의 방법이 비디오 소스로부터 직접 학습된 3D 장면 이해 및 공간 추론과 관련된 다양한 작업에서 상당한 개선을 달성했음을 확인했다. 특히, 명시적인 3D 데이터 입력에 의존하지 않는 우리의 4B 모델은 기존의 최신 방법들과 비교하여 경쟁력 있는 결과를 보였으며, VSI-Bench 평가에서 Gemini-1.5-Pro를 능가하는 성과를 거두었다.

English

Previous research has investigated the application of Multimodal Large Language Models (MLLMs) in understanding 3D scenes by interpreting them as videos. These approaches generally depend on comprehensive 3D data inputs, such as point clouds or reconstructed Bird's-Eye View (BEV) maps. In our research, we advance this field by enhancing the capability of MLLMs to understand and reason in 3D spaces directly from video data, without the need for additional 3D input. We propose a novel and efficient method, the Video-3D Geometry Large Language Model (VG LLM). Our approach employs a 3D visual geometry encoder that extracts 3D prior information from video sequences. This information is integrated with visual tokens and fed into the MLLM. Extensive experiments have shown that our method has achieved substantial improvements in various tasks related to 3D scene understanding and spatial reasoning, all directly learned from video sources. Impressively, our 4B model, which does not rely on explicit 3D data inputs, achieves competitive results compared to existing state-of-the-art methods, and even surpasses the Gemini-1.5-Pro in the VSI-Bench evaluations.

3D 세계를 위한 비디오 학습: 3D 시각 기하학적 사전 지식을 활용한 MLLM 강화

Learning from Videos for 3D World: Enhancing MLLMs with 3D Vision Geometry Priors

초록

Support