SpatialVLM：赋予视觉-语言模型空间推理能力

摘要

理解和推理空间关系是视觉问答（VQA）和机器人技术的基本能力。虽然视觉语言模型（VLM）在某些VQA基准测试中表现出色，但它们仍然缺乏3D空间推理能力，例如识别物体之间的数量关系，如距离或大小差异。我们假设VLM的有限空间推理能力是由于训练数据中缺乏3D空间知识，旨在通过使用互联网规模的空间推理数据来解决这一问题。为此，我们提出了一个系统来促进这种方法。我们首先开发了一个自动的3D空间VQA数据生成框架，可扩展到1亿个VQA示例，涵盖1000万张真实世界图像。然后，我们研究了训练配方中的各种因素，包括数据质量、训练流程和VLM架构。我们的工作展示了度量空间中首个互联网规模的3D空间推理数据集。通过在这些数据上训练VLM，我们显著增强了其在定性和定量空间VQA方面的能力。最后，我们证明了这种VLM由于其定量估计能力，解锁了链式空间推理和机器人技术中的新颖下游应用。项目网站：https://spatial-vlm.github.io/

English

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

SpatialVLM：赋予视觉-语言模型空间推理能力

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

摘要

Support