3D感知区域提示视觉语言模型

摘要

我们提出了空间区域三维感知视觉语言模型（SR-3D），该模型通过共享的视觉标记空间将单视图二维图像与多视图三维数据相连接。SR-3D支持灵活的区域提示功能，允许用户通过边界框、任意帧上的分割掩码或直接在三维空间中进行标注，而无需进行繁琐的多帧标记。我们通过将二维视觉特征与三维位置嵌入相结合来实现这一点，这使得三维模型能够利用强大的二维先验知识，在不同帧之间进行更准确的空间推理，即使目标物体并未在同一视图中同时出现。在通用二维视觉语言和专门的三维空间基准测试上的大量实验表明，SR-3D实现了最先进的性能，突显了其在统一二维与三维表示空间以理解场景方面的有效性。此外，我们观察到SR-3D在无需传感器三维输入或真实三维标注的野外视频中同样适用，能够准确推断空间关系和度量尺寸。

English

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

3D感知区域提示视觉语言模型

3D Aware Region Prompted Vision Language Model

摘要

Support