超越3D VQA：將三維空間先驗注入視覺語言模型以增強幾何推理

摘要

视觉-语言模型（VLM）在鲁棒的3D空间推理方面常常面临挑战。现有方法多依赖于通过3D视觉问答（VQA）数据集进行微调，这容易导致模型过拟合数据集特有的偏差；而集成专门的3D视觉编码器往往又缺乏灵活性且笨重繁琐。在本文中，我们认为真正的空间理解应源于对基本几何先验的学习，而非仅依赖高层次的VQA监督。为此，我们提出GASP（几何感知空间先验）框架，将这类先验直接注入大语言模型的Transformer层中。GASP采用一个小型对应头，作为跨所有层的深度监督信号，并基于大规模视频场景中的真实几何数据，通过双重目标进行训练：对比损失作用于真实点对应关系，强制实现2D视角不变性；深度一致性监督则用于解决3D几何歧义。我们的分析首先通过诊断表明，标准VLM的内部对应匹配准确率极低（通常低于5%）。随后，我们证明所提出的训练方法显著改善了这一问题，将逐层的峰值对应率提升至70%以上，并保持超过85%的时间鲁棒性，而基线方法始终低于5%。这些内部改进转化为下游空间基准测试上的显著增益，包括在All-Angles Bench上提升18.2%，在VSI-Bench上提升29.0%，且全程未使用任何3D VQA数据训练。我们的研究结果表明，从基本几何先验中学习是使VLM获得更可靠的3D空间推理能力的一条有前景且可泛化的路径。

English

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.