3D感知区域提示视觉语言模型
3D Aware Region Prompted Vision Language Model
September 16, 2025
作者: An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu
cs.AI
摘要
我们提出了空间区域三维感知视觉语言模型(SR-3D),该模型通过共享的视觉标记空间将单视图二维图像与多视图三维数据相连接。SR-3D支持灵活的区域提示功能,允许用户通过边界框、任意帧上的分割掩码或直接在三维空间中进行标注,而无需进行繁琐的多帧标记。我们通过将二维视觉特征与三维位置嵌入相结合来实现这一点,这使得三维模型能够利用强大的二维先验知识,在不同帧之间进行更准确的空间推理,即使目标物体并未在同一视图中同时出现。在通用二维视觉语言和专门的三维空间基准测试上的大量实验表明,SR-3D实现了最先进的性能,突显了其在统一二维与三维表示空间以理解场景方面的有效性。此外,我们观察到SR-3D在无需传感器三维输入或真实三维标注的野外视频中同样适用,能够准确推断空间关系和度量尺寸。
English
We present Spatial Region 3D (SR-3D) aware vision-language model that
connects single-view 2D images and multi-view 3D data through a shared visual
token space. SR-3D supports flexible region prompting, allowing users to
annotate regions with bounding boxes, segmentation masks on any frame, or
directly in 3D, without the need for exhaustive multi-frame labeling. We
achieve this by enriching 2D visual features with 3D positional embeddings,
which allows the 3D model to draw upon strong 2D priors for more accurate
spatial reasoning across frames, even when objects of interest do not co-occur
within the same view. Extensive experiments on both general 2D vision language
and specialized 3D spatial benchmarks demonstrate that SR-3D achieves
state-of-the-art performance, underscoring its effectiveness for unifying 2D
and 3D representation space on scene understanding. Moreover, we observe
applicability to in-the-wild videos without sensory 3D inputs or ground-truth
3D annotations, where SR-3D accurately infers spatial relationships and metric
measurements.