LangScene-X：基于TriMap视频扩散的通用三维语言嵌入场景重建

摘要

从二维图像中恢复具有开放词汇场景理解的三维结构是一项基础而艰巨的任务。近期研究通过结合语言信息进行逐场景优化，已在此领域取得进展。然而，这些方法严重依赖校准的密集视角重建范式，在视角有限时，会遭受严重的渲染伪影和不可信的语义合成问题。本文提出了一种创新的生成框架——LangScene-X，旨在统一并生成三维一致的多模态信息，以支持重建与理解。得益于生成一致新观察的能力，我们能够仅从稀疏视角构建可泛化的三维语言嵌入场景。具体而言，我们首先训练了一个TriMap视频扩散模型，该模型通过渐进式知识整合，能够从稀疏输入生成外观（RGB）、几何（法线）和语义（分割图）。此外，我们提出了一种在大规模图像数据集上训练的语言量化压缩器（LQC），以高效编码语言嵌入，实现跨场景泛化而无需逐场景重新训练。最后，我们通过将语言信息对齐到三维场景表面，重建了语言表面场，从而支持开放式语言查询。在真实世界数据上的大量实验表明，LangScene-X在质量和泛化能力上均优于现有最先进方法。项目页面：https://liuff19.github.io/LangScene-X。

English

Recovering 3D structures with open-vocabulary scene understanding from 2D images is a fundamental but daunting task. Recent developments have achieved this by performing per-scene optimization with embedded language information. However, they heavily rely on the calibrated dense-view reconstruction paradigm, thereby suffering from severe rendering artifacts and implausible semantic synthesis when limited views are available. In this paper, we introduce a novel generative framework, coined LangScene-X, to unify and generate 3D consistent multi-modality information for reconstruction and understanding. Powered by the generative capability of creating more consistent novel observations, we can build generalizable 3D language-embedded scenes from only sparse views. Specifically, we first train a TriMap video diffusion model that can generate appearance (RGBs), geometry (normals), and semantics (segmentation maps) from sparse inputs through progressive knowledge integration. Furthermore, we propose a Language Quantized Compressor (LQC), trained on large-scale image datasets, to efficiently encode language embeddings, enabling cross-scene generalization without per-scene retraining. Finally, we reconstruct the language surface fields by aligning language information onto the surface of 3D scenes, enabling open-ended language queries. Extensive experiments on real-world data demonstrate the superiority of our LangScene-X over state-of-the-art methods in terms of quality and generalizability. Project Page: https://liuff19.github.io/LangScene-X.

LangScene-X：基于TriMap视频扩散的通用三维语言嵌入场景重建

LangScene-X: Reconstruct Generalizable 3D Language-Embedded Scenes with TriMap Video Diffusion

摘要

Support