LangSplat: 3D 언어 가우시안 스플래팅

초록

인간은 3차원 세계에서 살며 자연어를 사용하여 3D 장면과 상호작용합니다. 최근 3D 공간에서의 개방형 언어 질의를 지원하기 위한 3D 언어 필드 모델링이 점점 더 주목받고 있습니다. 본 논문은 LangSplat을 소개하며, 이는 3D 공간 내에서 정확하고 효율적인 개방형 어휘 질의를 가능하게 하는 3D 언어 필드를 구축합니다. 기존의 NeRF 모델에 CLIP 언어 임베딩을 적용하는 방법과 달리, LangSplat은 CLIP에서 추출된 언어 특징을 인코딩한 3D 가우시안 집합을 사용하여 언어 필드를 표현함으로써 이 분야를 발전시킵니다. 언어 특징을 렌더링하기 위해 타일 기반 스플래팅 기법을 사용함으로써, NeRF에 내재된 고비용 렌더링 프로세스를 회피합니다. LangSplat은 CLIP 임베딩을 직접 학습하는 대신, 먼저 장면별 언어 오토인코더를 학습한 후 장면 특정 잠재 공간에서 언어 특징을 학습함으로써 명시적 모델링이 요구하는 상당한 메모리 부담을 완화합니다. 기존 방법들은 객체 간 명확한 경계를 구분하지 못하는 부정확하고 모호한 3D 언어 필드에 어려움을 겪습니다. 우리는 이 문제를 심층적으로 분석하고 SAM을 사용하여 계층적 의미를 학습함으로써 다양한 스케일에서 언어 필드를 광범위하게 질의할 필요와 DINO 특징의 정규화를 제거하는 방안을 제안합니다. 개방형 어휘 3D 객체 위치 지정 및 의미론적 분할에 대한 광범위한 실험을 통해 LangSplat이 이전의 최첨단 방법인 LERF를 큰 차이로 능가함을 입증합니다. 특히, LangSplat은 매우 효율적이며, 1440x1080 해상도에서 LERF 대비 {\speed}배의 속도 향상을 달성합니다. 독자 여러분께서는 https://langsplat.github.io에서 저희의 비디오 결과를 확인하시길 강력히 권장합니다.

English

Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} times speedup compared to LERF at the resolution of 1440 times 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

LangSplat: 3D 언어 가우시안 스플래팅

LangSplat: 3D Language Gaussian Splatting

초록

Support