3D感知區域提示視覺語言模型

摘要

我們提出了空間區域三維（SR-3D）感知的視覺語言模型，該模型通過共享的視覺標記空間將單視圖二維圖像與多視圖三維數據相連接。SR-3D支持靈活的區域提示，允許用戶在任意幀上使用邊界框、分割掩碼進行區域標註，或直接在3D中進行標註，而無需進行繁瑣的多幀標記。我們通過將3D位置嵌入豐富到2D視覺特徵中來實現這一點，這使得3D模型能夠利用強大的2D先驗知識，在跨幀的空間推理中實現更高的準確性，即使感興趣的物體不在同一視圖中出現。在通用2D視覺語言和專門的3D空間基準上的大量實驗表明，SR-3D達到了最先進的性能，突顯了其在統一2D和3D表示空間以進行場景理解方面的有效性。此外，我們觀察到SR-3D在沒有傳感器3D輸入或真實3D註釋的野外視頻中的適用性，能夠準確推斷空間關係和度量測量。

English

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

3D感知區域提示視覺語言模型

3D Aware Region Prompted Vision Language Model

摘要

Support