4DLangVGGT:四維語言視覺幾何接地變壓器
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
December 4, 2025
作者: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang
cs.AI
摘要
建構四維語言場對於具身人工智慧、擴增/實境以及四維場景理解至關重要,因其能提供動態環境的豐富語義表徵,並支持複雜場景下的開放詞彙查詢。然而,現有四維語義場建構方法主要依賴於場景特定的高斯潑濺技術,這種方法需進行逐場景優化、泛化能力有限,且難以擴展至實際應用。為解決這些局限性,我們提出4DLangVGGT——首個基於Transformer的前饋式統一框架,用於四維語言定位,將幾何感知與語言對標整合於單一架構中。4DLangVGGT包含兩個核心組件:專注於捕捉動態場景時空幾何表徵的四維視覺幾何Transformer(StreamVGGT),以及將幾何感知特徵投影至語言對標語義空間的語義橋接解碼器(SBD),在保持結構保真度的同時增強語義可解釋性。有別於依賴高成本逐場景優化的既有方法,4DLangVGGT可跨多個動態場景聯合訓練,並在推理時直接應用,實現部署效率與強泛化能力的兼得。此設計顯著提升大規模部署的實用性,為開放詞彙四維場景理解建立新範式。在HyperNeRF與Neu3D數據集上的實驗表明,我們的方法不僅有效泛化,更達到最先進性能:在逐場景訓練下提升達2%,多場景訓練下提升達1%。程式碼已開源於https://github.com/hustvl/4DLangVGGT。
English
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT