LangSplat: 3D言語ガウススプラッティング

要旨

人間は3D世界に住み、自然言語を用いて3Dシーンと相互作用することが一般的です。3D空間におけるオープンエンドな言語クエリをサポートするための3D言語フィールドのモデリングが、最近注目を集めています。本論文では、LangSplatを紹介します。LangSplatは、3D空間内で正確かつ効率的なオープン語彙クエリを可能にする3D言語フィールドを構築します。既存の手法がNeRFモデルにCLIP言語埋め込みを基盤としているのに対し、LangSplatは、CLIPから蒸留された言語特徴をエンコードする3Dガウシアンの集合を利用して言語フィールドを表現することで、この分野を前進させます。言語特徴をレンダリングするためにタイルベースのスプラッティング技術を採用することで、NeRFに内在する高コストなレンダリングプロセスを回避します。LangSplatは、CLIP埋め込みを直接学習する代わりに、まずシーンごとの言語オートエンコーダを訓練し、その後シーン固有の潜在空間で言語特徴を学習することで、明示的なモデリングが課す多大なメモリ要求を軽減します。既存の手法は、不正確で曖昧な3D言語フィールドに苦しんでおり、オブジェクト間の明確な境界を識別できません。我々はこの問題に深く掘り下げ、SAMを使用して階層的なセマンティクスを学習することを提案し、さまざまなスケールで言語フィールドを広範囲にクエリする必要性とDINO特徴の正則化を排除します。オープン語彙3Dオブジェクトローカライゼーションとセマンティックセグメンテーションに関する広範な実験により、LangSplatが従来の最先端手法であるLERFを大幅に上回ることを示します。特に、LangSplatは非常に効率的で、1440×1080の解像度においてLERFと比較して{\speed}倍の高速化を達成します。ぜひ、我々のビデオ結果をhttps://langsplat.github.ioでご覧ください。

English

Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} times speedup compared to LERF at the resolution of 1440 times 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

LangSplat: 3D言語ガウススプラッティング

LangSplat: 3D Language Gaussian Splatting

要旨

Support