Stream3D-VLM: 점진적 기하 사전을 활용한 온라인 3D 공간 이해

초록

3D 장면 이해의 발전에도 불구하고, 기존의 3D 대규모 멀티모달 모델은 완전한 장면 관측이나 사전 정의된 비디오 클립을 필요로 하는 오프라인 환경에서 작동한다. 본 논문에서는 스트리밍 비디오로부터 실시간 공간 이해를 가능하게 하는 온라인 3D 시각-언어 모델을 제시한다. 우리의 접근 방식은 LLM의 다음 토큰 예측 목표에 기반한 자기회귀적 스트리밍 제어 모델링을 채택하여 응답 시점을 학습하고, 경량의 시각-공간 특징 통합(VSFI) 모듈을 사용하여 시간적으로 정렬된 기하학적 사전 정보를 시각 스트림에 점진적으로 주입한다. 긴 맥락 디코딩 오버헤드를 완화하기 위해, 효율적인 시각 토큰 압축을 위한 플러그 앤 플레이 방식의 기하학 적응형 복셀 압축(GAVC) 모듈을 제안한다. 스트리밍 3D-언어 데이터의 부족을 해결하기 위해, 100만 개 이상의 온라인 시공간 3D QA 쌍을 선별하고 29개 작업에 걸친 포괄적인 벤치마크를 구축하는 확장 가능한 데이터 생성 파이프라인을 추가로 개발한다. 광범위한 실험을 통해 우리의 접근 방식이 온라인 및 오프라인 3D 공간 이해, 추론 및 접지 작업에서 독점 모델과 오픈소스 모델을 모두 크게 능가함을 보여준다. 프로젝트 페이지는 https://stream3d-vlm.github.io/ 에서 확인할 수 있다.

English

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/