3D VQA를 넘어서: 시각-언어 모델에 3D 공간 사전 지식 주입을 통한 기하학적 추론 향상

초록

비전-언어 모델(VLM)은 종종 강건한 3D 공간 추론에 어려움을 겪는다. 기존의 주된 방법들은 3D 시각 질의응답(VQA) 데이터셋을 미세 조정하는 데 의존하는데, 이는 데이터셋 특유의 편향에 과적합될 수 있으며, 특수화된 3D 시각 인코더를 통합하는 것은 종종 유연하지 못하고 번거롭다. 본 논문에서는 진정한 공간 이해가 고수준의 VQA 감독뿐만 아니라 기본적인 기하학적 사전 지식을 학습함으로써 비롯되어야 한다고 주장한다. 우리는 이러한 사전 지식을 LLM의 트랜스포머 층에 직접 주입하는 프레임워크인 GASP(Geometric-Aware Spatial Priors)를 제안한다. GASP는 모든 층에 걸쳐 깊은 감독 신호로 적용되는 소형 대응 헤드를 사용하며, 대규모 비디오 장면의 실제 기하학을 활용한 이중 목적 함수로 훈련된다. 즉, 실제 점 대응에 대한 대조 손실은 2D 시점 불변성을 강제하고, 깊이 일관성 감독은 3D 기하학적 모호성을 해결한다. 우리의 분석은 먼저 표준 VLM의 내부 대응 정합 정확도가 매우 낮음(종종 5% 미만)을 보여주는 진단을 제공한다. 그런 다음 우리의 훈련이 이 동작을 실질적으로 개선하여 층별 대응 최고치를 70% 이상으로 끌어올리고 기준선이 5% 미만인 상태에서 85% 이상의 시간적 강건성을 유지함을 입증한다. 이러한 내부 개선은 하류 공간 벤치마크에서 상당한 성능 향상으로 이어져, All-Angles Bench에서 +18.2%, VSI-Bench에서 +29.0%의 향상을 보였으며, 이 모든 것은 3D VQA 데이터에 대한 훈련 없이 이루어졌다. 우리의 발견은 기본 기하학적 사전 지식으로부터 학습하는 것이 보다 신뢰할 수 있는 3D 공간 추론을 갖춘 VLM을 향한 유망하고 일반화 가능한 경로임을 시사한다.

English

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.