4D LangSplat:基於多模態大型語言模型的四維語言高斯潑濺
4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models
March 13, 2025
作者: Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Johannes Herter, Minghan Qin, Gao Huang, Hanspeter Pfister
cs.AI
摘要
學習4D語言場以實現動態場景中時間敏感、開放式語言查詢,對於許多現實世界應用至關重要。雖然LangSplat成功將CLIP特徵嵌入到3D高斯表示中,在3D靜態場景中實現了精確性和效率,但它無法處理動態4D場,因為CLIP專為靜態圖像-文本任務設計,無法捕捉視頻中的時間動態。現實環境本質上是動態的,物體語義隨時間演變。構建精確的4D語言場需要獲取像素對齊、物體級別的視頻特徵,而當前視覺模型難以實現這一點。為應對這些挑戰,我們提出了4D LangSplat,它學習4D語言場以高效處理動態場景中時間無關或時間敏感的開放詞彙查詢。4D LangSplat繞過從視覺特徵學習語言場,而是直接從通過多模態大語言模型(MLLMs)生成的物體級別視頻描述中學習。具體而言,我們提出了一種多模態物體級別視頻提示方法,包含視覺和文本提示,引導MLLMs為視頻中的物體生成詳細、時間一致、高質量的描述。這些描述使用大語言模型編碼為高質量句子嵌入,隨後作為像素對齊、物體特定的特徵監督,通過共享嵌入空間促進開放詞彙文本查詢。考慮到4D場景中的物體狀態呈現平滑過渡,我們進一步提出了一種狀態可變網絡,以有效建模這些隨時間的連續變化。我們在多個基準測試中的結果表明,4D LangSplat在時間敏感和時間無關的開放詞彙查詢中均達到了精確且高效的結果。
English
Learning 4D language fields to enable time-sensitive, open-ended language
queries in dynamic scenes is essential for many real-world applications. While
LangSplat successfully grounds CLIP features into 3D Gaussian representations,
achieving precision and efficiency in 3D static scenes, it lacks the ability to
handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot
capture temporal dynamics in videos. Real-world environments are inherently
dynamic, with object semantics evolving over time. Building a precise 4D
language field necessitates obtaining pixel-aligned, object-wise video
features, which current vision models struggle to achieve. To address these
challenges, we propose 4D LangSplat, which learns 4D language fields to handle
time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes
efficiently. 4D LangSplat bypasses learning the language field from vision
features and instead learns directly from text generated from object-wise video
captions via Multimodal Large Language Models (MLLMs). Specifically, we propose
a multimodal object-wise video prompting method, consisting of visual and text
prompts that guide MLLMs to generate detailed, temporally consistent,
high-quality captions for objects throughout a video. These captions are
encoded using a Large Language Model into high-quality sentence embeddings,
which then serve as pixel-aligned, object-specific feature supervision,
facilitating open-vocabulary text queries through shared embedding spaces.
Recognizing that objects in 4D scenes exhibit smooth transitions across states,
we further propose a status deformable network to model these continuous
changes over time effectively. Our results across multiple benchmarks
demonstrate that 4D LangSplat attains precise and efficient results for both
time-sensitive and time-agnostic open-vocabulary queries.Summary
AI-Generated Summary