LangSplat：3D 語言高斯飛濺

摘要

人類生活在一個三維世界中，通常使用自然語言來與三維場景進行互動。最近，建模三維語言場以支持在三維空間中進行開放式語言查詢已經引起了越來越多的關注。本文介紹了LangSplat，它構建了一個三維語言場，可以在三維空間內實現精確且高效的開放詞彙查詢。與將CLIP語言嵌入基於NeRF模型的現有方法不同，LangSplat通過利用一系列三維高斯函數，每個函數編碼自CLIP中提煉出的語言特徵，來表示語言場，從而推進了這一領域。通過採用基於瓦片的splatting技術來渲染語言特徵，我們避免了NeRF中固有的昂貴渲染過程。LangSplat不是直接學習CLIP嵌入，而是首先訓練一個基於場景的語言自編碼器，然後在特定於場景的潛在空間上學習語言特徵，從而減輕了明確建模所帶來的大量內存需求。現有方法在不精確和模糊的三維語言場方面存在問題，無法明確區分物體之間的清晰邊界。我們深入探討了這個問題，並建議使用SAM來學習分層語義，從而消除了在各種尺度上廣泛查詢語言場和DINO特徵的規範化的需要。對開放詞彙三維物體定位和語義分割的大量實驗表明，LangSplat在很大程度上優於之前的最先進方法LERF。值得注意的是，LangSplat非常高效，在1440乘以1080的分辨率下實現了{\speed}倍的加速，比LERF快得多。我們強烈建議讀者查看我們的視頻結果，網址為https://langsplat.github.io。

English

Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} times speedup compared to LERF at the resolution of 1440 times 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

LangSplat：3D 語言高斯飛濺

LangSplat: 3D Language Gaussian Splatting

摘要

Support