空間推論のための幾何学的要素の重要性

要旨

大規模な学習により強化された視覚言語モデル（VLM）は、画像や動画の理解において優れた性能を発揮するが、静的なシーンと動的な動画の両方における空間推論能力は依然として限られている。最近の研究では、この制限を克服するため、事前学習済み3D基盤モデルから抽出した幾何学トークンをVLMに注入する手法が試みられている。しかし、この分野における単純なトークン融合と標準的なファインチューニングでは、空間推論において幾何学的な手がかりが十分に活用されないことが観察される。VLMが2次元の視覚的手がかりに強く依存する傾向があるためである。本論文では、VLMが幾何学トークンを積極的に利用して推論することを促進し、幾何学を意味あるものとするフレームワークGeoSRを提案する。GeoSRは二つの主要な要素を導入する：（1）幾何学解放マスキングは、訓練中に2次元視覚トークンの一部を戦略的にマスクすることで、非幾何学的な近道を弱め、モデルが空間推論において幾何学トークンを参照することを強制する。（2）幾何学誘導融合は、幾何学的証拠が決定的に重要な領域において幾何学トークンの寄与を適応的に増幅するゲーティング機構である。これらの設計により、空間推論タスクにおける幾何学トークンの潜在能力が解放される。静的および動的空間推論ベンチマークにおける広範な実験により、GeoSRが従来手法を一貫して上回り、幾何学情報を効果的に活用することで新たなstate-of-the-art性能を確立することが実証された。プロジェクトページはhttps://suhzhang.github.io/GeoSR/で公開されている。

English

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

空間推論のための幾何学的要素の重要性

Make Geometry Matter for Spatial Reasoning

要旨

Support