LangScene-X:基於TriMap視頻擴散的通用三維語言嵌入場景重建
LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion
July 3, 2025
作者: Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, Yueqi Duan
cs.AI
摘要
从二维图像中恢复三维结构并实现开放词汇场景理解,是一项基础而艰巨的任务。近期研究通过结合语言信息进行逐场景优化,已取得一定进展。然而,这些方法严重依赖校准的密集视角重建范式,在视角有限时,会遭受严重的渲染伪影和不可信的语义合成问题。本文提出了一种新颖的生成框架,命名为LangScene-X,旨在统一并生成三维一致的多模态信息,以支持重建与理解。得益于生成一致新观察的能力,我们能够仅从稀疏视角构建可泛化的三维语言嵌入场景。具体而言,我们首先训练了一个TriMap视频扩散模型,该模型通过渐进式知识整合,能够从稀疏输入生成外观(RGB)、几何(法线)和语义(分割图)。此外,我们提出了一种在大规模图像数据集上训练的语言量化压缩器(LQC),以高效编码语言嵌入,实现跨场景泛化而无需逐场景重新训练。最后,我们通过将语言信息对齐到三维场景表面,重建了语言表面场,从而支持开放式的语言查询。在真实世界数据上的大量实验表明,LangScene-X在质量和泛化能力上均优于现有最先进方法。项目页面:https://liuff19.github.io/LangScene-X。
English
Recovering 3D structures with open-vocabulary scene understanding from 2D
images is a fundamental but daunting task. Recent developments have achieved
this by performing per-scene optimization with embedded language information.
However, they heavily rely on the calibrated dense-view reconstruction
paradigm, thereby suffering from severe rendering artifacts and implausible
semantic synthesis when limited views are available. In this paper, we
introduce a novel generative framework, coined LangScene-X, to unify and
generate 3D consistent multi-modality information for reconstruction and
understanding. Powered by the generative capability of creating more consistent
novel observations, we can build generalizable 3D language-embedded scenes from
only sparse views. Specifically, we first train a TriMap video diffusion model
that can generate appearance (RGBs), geometry (normals), and semantics
(segmentation maps) from sparse inputs through progressive knowledge
integration. Furthermore, we propose a Language Quantized Compressor (LQC),
trained on large-scale image datasets, to efficiently encode language
embeddings, enabling cross-scene generalization without per-scene retraining.
Finally, we reconstruct the language surface fields by aligning language
information onto the surface of 3D scenes, enabling open-ended language
queries. Extensive experiments on real-world data demonstrate the superiority
of our LangScene-X over state-of-the-art methods in terms of quality and
generalizability. Project Page: https://liuff19.github.io/LangScene-X.