EmbodiedSplat:面向开放词汇3D场景理解的在线前馈语义3D高斯溅射
EmbodiedSplat: Online Feed-Forward Semantic 3DGS for Open-Vocabulary 3D Scene Understanding
March 4, 2026
作者: Seungjun Lee, Zihan Wang, Yunsong Wang, Gim Hee Lee
cs.AI
摘要
在具身智能任务中,智能体需要以在线且近乎实时的方式构建和理解三维场景,因此实现场景探索过程中的即时三维场景理解至关重要。本研究提出EmbodiedSplat——一种面向开放词汇场景理解的在线前馈式3D高斯溅射方法,能够通过流式图像输入同时实现在线三维重建与三维语义理解。与现有通常局限于离线或逐场景优化设置的开放词汇3DGS方法不同,我们的目标具有双重性:1)以在线方式从超过300张流式图像中重建具有语义嵌入的完整场景3DGS;2)通过前馈式设计实现对新场景的高度泛化能力,并结合实时二维模型支持近实时的三维语义重建。为实现这些目标,我们提出了带有CLIP全局码本的在线稀疏系数场,在将二维CLIP嵌入绑定至每个三维高斯的同时,最小化内存消耗并保持CLIP的完整语义泛化能力。此外,我们通过三维U-Net聚合3DGS的部分点云来生成三维几何感知的CLIP特征,从而为面向二维的语言嵌入补偿三维几何先验。在ScanNet、ScanNet++和Replica等多个室内数据集上的大量实验表明,我们的方法兼具卓越的有效性与高效性。欢迎访问项目页面https://0nandon.github.io/EmbodiedSplat/。
English
Understanding a 3D scene immediately with its exploration is essential for embodied tasks, where an agent must construct and comprehend the 3D scene in an online and nearly real-time manner. In this study, we propose EmbodiedSplat, an online feed-forward 3DGS for open-vocabulary scene understanding that enables simultaneous online 3D reconstruction and 3D semantic understanding from the streaming images. Unlike existing open-vocabulary 3DGS methods which are typically restricted to either offline or per-scene optimization setting, our objectives are two-fold: 1) Reconstructs the semantic-embedded 3DGS of the entire scene from over 300 streaming images in an online manner. 2) Highly generalizable to novel scenes with feed-forward design and supports nearly real-time 3D semantic reconstruction when combined with real-time 2D models. To achieve these objectives, we propose an Online Sparse Coefficients Field with a CLIP Global Codebook where it binds the 2D CLIP embeddings to each 3D Gaussian while minimizing memory consumption and preserving the full semantic generalizability of CLIP. Furthermore, we generate 3D geometric-aware CLIP features by aggregating the partial point cloud of 3DGS through 3D U-Net to compensate the 3D geometric prior to 2D-oriented language embeddings. Extensive experiments on diverse indoor datasets, including ScanNet, ScanNet++, and Replica, demonstrate both the effectiveness and efficiency of our method. Check out our project page in https://0nandon.github.io/EmbodiedSplat/.