3D 인식 지역 프롬프트 기반 시각 언어 모델

초록

단일 뷰 2D 이미지와 다중 뷰 3D 데이터를 공유된 시각적 토큰 공간을 통해 연결하는 Spatial Region 3D(SR-3D) 인식 비전-언어 모델을 제안합니다. SR-3D는 유연한 영역 프롬프팅을 지원하여 사용자가 바운딩 박스, 세그멘테이션 마스크를 통해 어느 프레임에서나 또는 직접 3D로 영역을 주석 달 수 있게 하며, 다중 프레임에 걸친 지루한 라벨링이 필요하지 않습니다. 이를 위해 2D 시각적 특징을 3D 위치 임베딩으로 강화함으로써, 3D 모델이 강력한 2D 사전 지식을 활용하여 동일한 뷰 내에서 관심 객체가 동시에 나타나지 않더라도 프레임 간 정확한 공간 추론을 수행할 수 있게 합니다. 일반적인 2D 비전-언어 및 전문적인 3D 공간 벤치마크에 대한 광범위한 실험을 통해 SR-3D가 최신 성능을 달성함을 보여주며, 장면 이해를 위한 2D와 3D 표현 공간의 통합 효과를 입증합니다. 또한, SR-3D는 센서 기반 3D 입력이나 실측 3D 주석 없이도 실제 비디오에서 공간 관계와 미터법 측정을 정확하게 추론하는 데 적용 가능함을 관찰했습니다.

English

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

3D 인식 지역 프롬프트 기반 시각 언어 모델

3D Aware Region Prompted Vision Language Model

초록

Support