SpatialVLM: 시각-언어 모델에 공간 추론 능력 부여하기

초록

시각적 질의응답(VQA)과 로봇공학에서 공간 관계를 이해하고 추론하는 능력은 기본적인 역량입니다. 비전 언어 모델(VLM)은 특정 VQA 벤치마크에서 뛰어난 성능을 보여주었지만, 여전히 거리나 크기 차이와 같은 물리적 객체의 양적 관계를 인식하는 3D 공간 추론 능력이 부족합니다. 우리는 VLM의 제한된 공간 추론 능력이 훈련 데이터에 3D 공간 지식이 부족하기 때문이라고 가정하고, 이를 해결하기 위해 인터넷 규모의 공간 추론 데이터로 VLM을 훈련시키는 것을 목표로 합니다. 이를 위해, 우리는 이러한 접근법을 촉진하는 시스템을 제시합니다. 먼저, 1천만 개의 실제 이미지에 대해 20억 개의 VQA 예제를 생성할 수 있는 자동 3D 공간 VQA 데이터 생성 프레임워크를 개발합니다. 그런 다음 데이터 품질, 훈련 파이프라인, VLM 아키텍처를 포함한 훈련 레시피의 다양한 요소를 조사합니다. 우리의 작업은 미터법 공간에서 최초의 인터넷 규모 3D 공간 추론 데이터셋을 특징으로 합니다. 이러한 데이터로 VLM을 훈련함으로써, 우리는 질적 및 양적 공간 VQA에서의 능력을 크게 향상시킵니다. 마지막으로, 이 VLM이 양적 추정 능력으로 인해 사고 연쇄 공간 추론과 로봇공학에서 새로운 하위 응용 프로그램을 가능하게 한다는 것을 보여줍니다. 프로젝트 웹사이트: https://spatial-vlm.github.io/

English

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

SpatialVLM: 시각-언어 모델에 공간 추론 능력 부여하기

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

초록

Support