生成模型通晓空间：释放场景理解中的隐式三维先验

摘要

尽管多模态大语言模型展现出强大的语义理解能力，但其常存在空间盲区，难以进行细粒度几何推理和物理动态推演。现有方案通常依赖显式3D模态或复杂几何支架，受限于数据稀缺与泛化挑战。本研究提出范式转换，通过利用大规模视频生成模型中的隐式空间先验，指出此类模型为合成时序连贯的视频，必然已学习到稳健的3D结构先验与物理规律。我们提出VEGA-3D（视频提取生成感知）框架——一种即插即用方案，将预训练视频扩散模型重构为潜在世界模拟器。通过从中间噪声层级提取时空特征，并借助令牌级自适应门控融合机制将其与语义表征集成，我们在无需显式3D监督的情况下为多模态大语言模型注入密集几何线索。在3D场景理解、空间推理和具身操作基准测试中的大量实验表明，本方法优于现有最优基线，验证了生成式先验可为物理世界理解提供可扩展的基础。代码已开源：https://github.com/H-EmbodVis/VEGA-3D。

English

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

生成模型通晓空间：释放场景理解中的隐式三维先验

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

摘要

Support