4DLangVGGT:四维语言-视觉几何基础Transformer
4DLangVGGT: 4D Language-Visual Geometry Grounded Transformer
December 4, 2025
作者: Xianfeng Wu, Yajing Bai, Minghan Li, Xianzu Wu, Xueqi Zhao, Zhongyuan Lai, Wenyu Liu, Xinggang Wang
cs.AI
摘要
构建4D语言场对于具身人工智能、增强/虚拟现实以及4D场景理解至关重要,因其能提供动态环境的丰富语义表征,并支持复杂场景下的开放词汇查询。然而,现有4D语义场构建方法主要依赖场景特定的高斯泼溅技术,这类方法需进行逐场景优化、泛化能力有限且难以扩展到实际应用。为突破这些局限,我们提出4DLangVGGT——首个基于Transformer的前馈式统一框架,将几何感知与语言对齐协同整合于单一架构中。该框架包含两大核心组件:4D视觉几何Transformer(StreamVGGT)负责捕捉动态场景的时空几何表征;语义桥接解码器(SBD)则将几何感知特征映射至语言对齐的语义空间,在保持结构保真度的同时增强语义可解释性。与依赖昂贵逐场景优化的传统方法不同,4DLangVGGT支持跨动态场景的联合训练,并能直接应用于推理阶段,实现了部署效率与强泛化能力的统一。这一设计显著提升了大规模部署的实用性,为开放词汇4D场景理解建立了新范式。在HyperNeRF和Neu3D数据集上的实验表明,我们的方法不仅有效泛化,更达到了最先进性能:在单场景训练下实现最高2%的性能提升,在多场景训练下获得1%的改进。代码已开源于https://github.com/hustvl/4DLangVGGT。
English
Constructing 4D language fields is crucial for embodied AI, augmented/virtual reality, and 4D scene understanding, as they provide enriched semantic representations of dynamic environments and enable open-vocabulary querying in complex scenarios. However, existing approaches to 4D semantic field construction primarily rely on scene-specific Gaussian splatting, which requires per-scene optimization, exhibits limited generalization, and is difficult to scale to real-world applications. To address these limitations, we propose 4DLangVGGT, the first Transformer-based feed-forward unified framework for 4D language grounding, that jointly integrates geometric perception and language alignment within a single architecture. 4DLangVGGT has two key components: the 4D Visual Geometry Transformer, StreamVGGT, which captures spatio-temporal geometric representations of dynamic scenes; and the Semantic Bridging Decoder (SBD), which projects geometry-aware features into a language-aligned semantic space, thereby enhancing semantic interpretability while preserving structural fidelity. Unlike prior methods that depend on costly per-scene optimization, 4DLangVGGT can be jointly trained across multiple dynamic scenes and directly applied during inference, achieving both deployment efficiency and strong generalization. This design significantly improves the practicality of large-scale deployment and establishes a new paradigm for open-vocabulary 4D scene understanding. Experiments on HyperNeRF and Neu3D datasets demonstrate that our approach not only generalizes effectively but also achieves state-of-the-art performance, achieving up to 2% gains under per-scene training and 1% improvements under multi-scene training. Our code released in https://github.com/hustvl/4DLangVGGT