4D LangSplat: マルチモーダル大規模言語モデルによる4次元言語ガウススプラッティング

要旨

動的なシーンにおいて時間を考慮したオープンエンドな言語クエリを可能にするため、4D言語フィールドを学習することは、多くの実世界のアプリケーションにとって不可欠です。LangSplatはCLIP特徴を3Dガウシアン表現に基づかせることで、3D静的なシーンにおいて精度と効率性を実現していますが、動的な4Dフィールドを扱う能力を欠いています。これは、CLIPが静的な画像-テキストタスク向けに設計されており、ビデオの時間的ダイナミクスを捉えることができないためです。実世界の環境は本質的に動的であり、オブジェクトの意味は時間とともに変化します。正確な4D言語フィールドを構築するためには、ピクセル単位で整列したオブジェクトごとのビデオ特徴を取得する必要がありますが、現在の視覚モデルではこれを実現することが困難です。これらの課題に対処するため、我々は4D LangSplatを提案します。4D LangSplatは、動的なシーンにおいて時間を考慮しないまたは時間を考慮したオープン語彙クエリを効率的に処理するために、4D言語フィールドを学習します。4D LangSplatは、視覚特徴から言語フィールドを学習するのではなく、マルチモーダル大規模言語モデル（MLLM）を介してオブジェクトごとのビデオキャプションから生成されたテキストから直接学習します。具体的には、ビジュアルプロンプトとテキストプロンプトから成るマルチモーダルオブジェクトごとのビデオプロンプティング手法を提案し、MLLMがビデオ全体を通じてオブジェクトの詳細で時間的に一貫した高品質なキャプションを生成することを促します。これらのキャプションは大規模言語モデルを使用して高品質な文埋め込みにエンコードされ、その後、ピクセル単位で整列したオブジェクト固有の特徴の教師信号として機能し、共有埋め込み空間を通じてオープン語彙テキストクエリを容易にします。4Dシーン内のオブジェクトが状態間で滑らかに遷移することを認識し、我々はさらにこれらの連続的な変化を効果的にモデル化するためのステータス変形可能ネットワークを提案します。複数のベンチマークにわたる結果は、4D LangSplatが時間を考慮したおよび時間を考慮しないオープン語彙クエリの両方に対して、正確で効率的な結果を達成することを示しています。

English

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

4D LangSplat: マルチモーダル大規模言語モデルによる4次元言語ガウススプラッティング

4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models

要旨

Support