SpatialVLM:賦予視覺語言模型空間推理能力
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
January 22, 2024
作者: Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, Fei Xia
cs.AI
摘要
理解和推理空間關係是視覺問答(VQA)和機器人技術的基本能力。雖然視覺語言模型(VLM)在某些VQA基準測試中表現出色,但它們仍然缺乏3D空間推理的能力,例如識別物體之間的量化關係,如距離或大小差異。我們假設VLM的有限空間推理能力是由於訓練數據中缺乏3D空間知識,並旨在通過使用互聯網規模的空間推理數據來解決這個問題。為此,我們提出了一個系統來促進這種方法。我們首先開發了一個自動的3D空間VQA數據生成框架,可擴展到1億真實世界圖像上的20億VQA示例。然後,我們研究了訓練配方中的各種因素,包括數據質量、訓練流程和VLM架構。我們的工作在度量空間中具有第一個互聯網規模的3D空間推理數據集。通過在這樣的數據上訓練VLM,我們顯著增強了其在質性和量性空間VQA方面的能力。最後,我們展示了這種VLM由於其量化估計能力而在思維鏈空間推理和機器人技術中解鎖了新的下游應用。項目網站:https://spatial-vlm.github.io/
English
Understanding and reasoning about spatial relationships is a fundamental
capability for Visual Question Answering (VQA) and robotics. While Vision
Language Models (VLM) have demonstrated remarkable performance in certain VQA
benchmarks, they still lack capabilities in 3D spatial reasoning, such as
recognizing quantitative relationships of physical objects like distances or
size differences. We hypothesize that VLMs' limited spatial reasoning
capability is due to the lack of 3D spatial knowledge in training data and aim
to solve this problem by training VLMs with Internet-scale spatial reasoning
data. To this end, we present a system to facilitate this approach. We first
develop an automatic 3D spatial VQA data generation framework that scales up to
2 billion VQA examples on 10 million real-world images. We then investigate
various factors in the training recipe, including data quality, training
pipeline, and VLM architecture. Our work features the first internet-scale 3D
spatial reasoning dataset in metric space. By training a VLM on such data, we
significantly enhance its ability on both qualitative and quantitative spatial
VQA. Finally, we demonstrate that this VLM unlocks novel downstream
applications in chain-of-thought spatial reasoning and robotics due to its
quantitative estimation capability. Project website:
https://spatial-vlm.github.io/