LangSplatV2: 450+ FPS로 구현된 고차원 3D 언어 가우시안 스플래팅

초록

본 논문에서는 고차원 특징 스플래팅을 476.2 FPS로, 고해상도 이미지에 대한 3D 오픈-보컬러리 텍스트 쿼리를 384.6 FPS로 달성한 LangSplatV2를 소개한다. 이는 LangSplat 대비 각각 42배의 속도 향상과 47배의 성능 향상을 제공하며, 쿼리 정확도도 개선되었다. LangSplat은 2D CLIP 언어 특징을 3D로 임베딩하기 위해 가우시안 스플래팅을 사용하여 속도를 크게 향상시키고, SAM 의미론을 통해 정밀한 3D 언어 필드를 학습한다. 이러한 3D 언어 필드의 발전은 복잡한 장면 내에서 언어 상호작용이 필요한 애플리케이션에 매우 중요하다. 그러나 LangSplat은 고성능 A100 GPU를 사용하더라도 실시간 추론 성능(8.2 FPS)을 달성하지 못해, 더 넓은 적용이 심각하게 제한되고 있다. 본 논문에서는 먼저 LangSplat의 상세한 시간 분석을 수행하여, 주요 속도 병목 현상이 무거운 디코더에 있음을 확인했다. 우리의 해결책인 LangSplatV2는 각 가우시안이 전역 사전 내의 희소 코드로 작동한다고 가정하여, 무거운 디코더의 필요성을 완전히 제거한 3D 희소 계수 필드를 학습한다. 이러한 희소성을 활용하여, 우리는 CUDA 최적화와 함께 효율적인 희소 계수 스플래팅 방법을 추가로 제안한다. 이 방법은 초저차원 특징을 스플래팅하는 시간 비용만으로도 고품질의 고차원 특징 맵을 렌더링한다. 우리의 실험 결과는 LangSplatV2가 더 나은 또는 경쟁력 있는 쿼리 정확도를 달성할 뿐만 아니라, 훨씬 더 빠르다는 것을 보여준다. 코드와 데모는 프로젝트 페이지(https://langsplat-v2.github.io)에서 확인할 수 있다.

English

In this paper, we introduce LangSplatV2, which achieves high-dimensional feature splatting at 476.2 FPS and 3D open-vocabulary text querying at 384.6 FPS for high-resolution images, providing a 42 times speedup and a 47 times boost over LangSplat respectively, along with improved query accuracy. LangSplat employs Gaussian Splatting to embed 2D CLIP language features into 3D, significantly enhancing speed and learning a precise 3D language field with SAM semantics. Such advancements in 3D language fields are crucial for applications that require language interaction within complex scenes. However, LangSplat does not yet achieve real-time inference performance (8.2 FPS), even with advanced A100 GPUs, severely limiting its broader application. In this paper, we first conduct a detailed time analysis of LangSplat, identifying the heavyweight decoder as the primary speed bottleneck. Our solution, LangSplatV2 assumes that each Gaussian acts as a sparse code within a global dictionary, leading to the learning of a 3D sparse coefficient field that entirely eliminates the need for a heavyweight decoder. By leveraging this sparsity, we further propose an efficient sparse coefficient splatting method with CUDA optimization, rendering high-dimensional feature maps at high quality while incurring only the time cost of splatting an ultra-low-dimensional feature. Our experimental results demonstrate that LangSplatV2 not only achieves better or competitive query accuracy but is also significantly faster. Codes and demos are available at our project page: https://langsplat-v2.github.io.

LangSplatV2: 450+ FPS로 구현된 고차원 3D 언어 가우시안 스플래팅

LangSplatV2: High-dimensional 3D Language Gaussian Splatting with 450+ FPS

초록

Support