超越3D VQA：将3D空间先验注入视觉-语言模型以增强几何推理能力

摘要

视觉-语言模型（VLM）在鲁棒的3D空间推理方面常显不足。现有方法通常依赖基于3D视觉问答数据集进行微调，这可能导致过度拟合数据集特定偏差；而集成专用的3D视觉编码器则往往缺乏灵活性且较为繁琐。本文认为，真正的空间理解应源于对基本几何先验的学习，而非仅仅依赖高层级的VQA监督。我们提出GASP（几何感知空间先验）框架，该框架将这些先验直接注入大型语言模型的Transformer层中。GASP采用一个轻量级的对应头，作为跨所有层的深度监督信号，并通过利用大规模视频场景中的真实标注几何结构进行双目标训练：一方面通过基于真实点对应的对比损失强制实现2D视角不变性，另一方面通过深度一致性监督消除3D几何歧义。我们的分析首先提供了诊断性证据，表明标准VLM内部的对应匹配精度极低（通常低于5%）。接着我们证明，训练后该行为显著改善，将逐层峰值对应精度提升至70%以上，并将时间鲁棒性维持在超过85%的水平，而基线方法仍低于5%。这些内部改进在多项下游空间基准测试中转化为显著性能提升，包括在All-Angles Bench上提高18.2%，在VSI-Bench上提高29.0%，且所有这些提升均未使用任何3D VQA数据进行训练。我们的发现表明，从基本几何先验中学习是赋予VLM更可靠3D空间推理能力的一条有前景且可泛化的路径。

English

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.