LangSplat：3D语言高斯飞溅

摘要

人类生活在一个三维世界中，通常使用自然语言与三维场景进行交互。最近，建模三维语言场以支持在三维空间中进行开放式语言查询引起了越来越多的关注。本文介绍了LangSplat，它构建了一个三维语言场，实现了在三维空间内精确高效的开放词汇查询。与现有方法将CLIP语言嵌入基于NeRF模型的方法不同，LangSplat通过利用一系列三维高斯函数，每个高斯函数编码自CLIP中提炼的语言特征来代表语言场，推动了该领域的发展。通过采用基于瓦片的喷溅技术来渲染语言特征，我们避开了NeRF中固有的昂贵渲染过程。LangSplat不是直接学习CLIP嵌入，而是首先训练一个基于场景的语言自动编码器，然后在特定于场景的潜在空间上学习语言特征，从而减轻了显式建模所带来的大量内存需求。现有方法在处理不精确和模糊的三维语言场时遇到困难，无法区分物体之间的清晰边界。我们深入探讨了这个问题，并提出使用SAM学习分层语义，从而消除了在不同尺度上广泛查询语言场和DINO特征的规范化的需求。对开放词汇三维物体定位和语义分割的大量实验表明，LangSplat在很大程度上优于之前的最先进方法LERF。值得注意的是，LangSplat非常高效，在分辨率为1440乘以1080时，与LERF相比实现了{\speed}倍的加速。我们强烈建议读者查看我们的视频结果，网址为https://langsplat.github.io。

English

Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} times speedup compared to LERF at the resolution of 1440 times 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

LangSplat：3D语言高斯飞溅

LangSplat: 3D Language Gaussian Splatting

摘要

Support