3D対応領域プロンプト型視覚言語モデル

要旨

本論文では、単一視点の2D画像と多視点の3Dデータを共有の視覚トークン空間を通じて接続するSpatial Region 3D（SR-3D）対応の視覚言語モデルを提案します。SR-3Dは柔軟な領域プロンプティングをサポートし、ユーザーが任意のフレーム上のバウンディングボックスやセグメンテーションマスク、または直接3D空間で領域を注釈付けできるようにします。これにより、多フレームにわたる網羅的なラベリングを必要としません。この機能は、2D視覚特徴を3D位置埋め込みで強化することで実現されており、3Dモデルが強力な2D事前知識を活用して、関心対象が同一視点内に同時に存在しない場合でも、フレーム間でのより正確な空間推論を行うことを可能にします。一般的な2D視覚言語タスクと専門的な3D空間ベンチマークの両方における広範な実験により、SR-3Dが最先端の性能を達成し、シーン理解における2Dと3Dの表現空間を統合する効果を実証しています。さらに、センサーによる3D入力や真値の3Dアノテーションがない実世界のビデオにおいても適用可能であり、SR-3Dが空間関係や計測値を正確に推論できることを確認しました。

English

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

3D対応領域プロンプト型視覚言語モデル

3D Aware Region Prompted Vision Language Model

要旨

Support