Stream3D-VLM：基于增量几何先验的在线三维空间理解

摘要

尽管3D场景理解取得了进展，但现有3D大规模多模态模型仍局限于离线场景，需要完整的场景观测或预定义的视频片段。本文提出一种在线3D视觉语言模型，能够从流式视频中实现实时空间理解。我们的方法基于大语言模型的下一词预测目标，采用自回归流式控制建模来学习何时响应，并引入轻量级视觉-空间特征融合模块，逐步将时间对齐的几何先验注入视觉流中。为缓解长上下文解码开销，我们提出即插即用的几何自适应体素压缩模块，实现高效视觉标记压缩。针对流式3D语言数据稀缺问题，进一步开发可扩展的数据生成流水线，构建了超过100万个在线时空3D问答对，并建立了涵盖29个任务的综合基准。大量实验表明，本方法在在线与离线的3D空间理解、推理和定位任务中均显著优于商业模型与开源模型。项目主页：https://stream3d-vlm.github.io/

English

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/