SpatialVLM: 視覚言語モデルに空間推論能力を付与する

要旨

空間関係の理解と推論は、視覚的質問応答（VQA）とロボティクスにおける基本的な能力である。視覚言語モデル（VLM）は特定のVQAベンチマークで顕著な性能を示しているものの、距離やサイズの違いといった物理的オブジェクトの定量的関係を認識するといった3D空間推論の能力はまだ不足している。我々は、VLMの空間推論能力の限界が、訓練データにおける3D空間知識の欠如によるものであると仮説を立て、インターネット規模の空間推論データを用いてVLMを訓練することでこの問題を解決することを目指す。この目的のために、我々はこのアプローチを促進するシステムを提案する。まず、1000万枚の実世界の画像に基づいて20億のVQA例を生成する自動3D空間VQAデータ生成フレームワークを開発する。次に、データ品質、訓練パイプライン、VLMアーキテクチャといった訓練レシピにおける様々な要因を調査する。我々の研究は、メトリック空間における初のインターネット規模の3D空間推論データセットを特徴としている。このようなデータを用いてVLMを訓練することで、定性的および定量的な空間VQAにおける能力を大幅に向上させる。最後に、このVLMが定量的推定能力により、連鎖的思考による空間推論やロボティクスにおける新たな下流アプリケーションを可能にすることを示す。プロジェクトウェブサイト: https://spatial-vlm.github.io/

English

Understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. While Vision Language Models (VLM) have demonstrated remarkable performance in certain VQA benchmarks, they still lack capabilities in 3D spatial reasoning, such as recognizing quantitative relationships of physical objects like distances or size differences. We hypothesize that VLMs' limited spatial reasoning capability is due to the lack of 3D spatial knowledge in training data and aim to solve this problem by training VLMs with Internet-scale spatial reasoning data. To this end, we present a system to facilitate this approach. We first develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. We then investigate various factors in the training recipe, including data quality, training pipeline, and VLM architecture. Our work features the first internet-scale 3D spatial reasoning dataset in metric space. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA. Finally, we demonstrate that this VLM unlocks novel downstream applications in chain-of-thought spatial reasoning and robotics due to its quantitative estimation capability. Project website: https://spatial-vlm.github.io/

SpatialVLM: 視覚言語モデルに空間推論能力を付与する

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

要旨

Support