3次元VQAを超えて：視覚言語モデルへの3次元空間事前知識の注入による幾何学的推論の強化

要旨

視覚言語モデル（VLM）は、頑健な3D空間推論にしばしば苦戦する。3D視覚質問応答（VQA）データセットによるファインチューニングに依存する従来手法は、データセット固有のバイアスに過適合する可能性があり、一方で特殊な3D視覚エンコーダを統合する手法は、柔軟性に欠け煩雑であることが多い。本論文では、真の空間理解は高レベルのVQAによる教師信号だけでなく、基本的な幾何学的前提知識を学習することから生じるべきだと主張する。我々はGASP（Geometric-Aware Spatial Priors）を提案する。これは、これらの前提知識をLLMのトランスフォーマー層に直接注入するフレームワークである。GASPは、全層にわたる深層教師信号として適用される小型の対応関係ヘッドを採用し、大規模ビデオシーンからの正解幾何情報を活用した二重目的関数で訓練される。すなわち、正解の点対応関係に関する対照学習により2Dの視点不変性を強化し、深度一貫性の教師信号により3Dの幾何学的曖昧性を解消する。我々の分析ではまず、標準的なVLMの内部対応関係マッチング精度が非常に低い（しばしば5%未満）ことを示す診断結果を提示する。次に、我々の訓練がこの振る舞いを大幅に改善し、層ごとの対応関係をピークで70%以上に向上させ、時間的ロバスト性を85%以上に維持する一方、ベースラインは5%未満にとどまることを実証する。これらの内部改善は、下流の空間ベンチマークにおいて顕著な性能向上につながり、All-Angles Benchでは+18.2%、VSI-Benchでは+29.0%を達成する。これらはいずれも3D VQAデータを用いた訓練を一切行わずに実現された。我々の発見は、基本的な幾何学的前提知識からの学習が、より信頼性の高い3D空間推論を備えたVLMへの有望かつ一般化可能な道筋であることを示唆している。

English

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.