FMGS：基礎模型嵌入式3D高斯飛濺，用於全面的3D場景理解

摘要

精確感知現實世界3D物體的幾何和語義特性對於增強現實和機器人應用的持續演進至關重要。為此，我們提出了一種方法，將基礎模型的視覺-語言嵌入融入到3D高斯濺射（GS）中。本研究的關鍵貢獻是提出了一種有效的方法來重建和表示3D視覺-語言模型。這是通過將基於圖像的基礎模型生成的特徵映射提煉到我們的3D模型中所實現的。為確保高質量的渲染和快速的訓練，我們引入了一種新穎的場景表示方法，該方法整合了GS和多分辨率哈希編碼（MHE）的優勢。我們的有效訓練程序還引入了像素對齊損失，使相同語義實體的渲染特徵距離接近，遵循像素級語義邊界。我們的結果展示了出色的多視角語義一致性，有助於各種下游任務，並在開放詞彙語言為基礎的物體檢測上，優於最先進的方法10.2％，儘管我們的推斷速度快了851倍。這項研究探索了視覺、語言和3D場景表示的交集，為在不受控制的現實世界環境中增強場景理解鋪平了道路。我們計劃在論文被接受後發布代碼。

English

Precisely perceiving the geometric and semantic properties of real-world 3D objects is crucial for the continued evolution of augmented reality and robotic applications. To this end, we present (), which incorporates vision-language embeddings of foundation models into 3D Gaussian Splatting (GS). The key contribution of this work is an efficient method to reconstruct and represent 3D vision-language models. This is achieved by distilling feature maps generated from image-based foundation models into those rendered from our 3D model. To ensure high-quality rendering and fast training, we introduce a novel scene representation by integrating strengths from both GS and multi-resolution hash encodings (MHE). Our effective training procedure also introduces a pixel alignment loss that makes the rendered feature distance of same semantic entities close, following the pixel-level semantic boundaries. Our results demonstrate remarkable multi-view semantic consistency, facilitating diverse downstream tasks, beating state-of-the-art methods by 10.2 percent on open-vocabulary language-based object detection, despite that we are 851times faster for inference. This research explores the intersection of vision, language, and 3D scene representation, paving the way for enhanced scene understanding in uncontrolled real-world environments. We plan to release the code upon paper acceptance.

FMGS：基礎模型嵌入式3D高斯飛濺，用於全面的3D場景理解

FMGS: Foundation Model Embedded 3D Gaussian Splatting for Holistic 3D Scene Understanding

摘要

Support