生成模型洞悉空間：釋放隱性三維先驗以實現場景理解

摘要

儘管多模態大型語言模型展現出卓越的語義理解能力，卻常存在空間盲區，難以進行細粒度幾何推理與物理動態分析。現有解決方案通常依賴顯式三維模態或複雜幾何框架，但受制於數據稀缺與泛化挑戰。本研究提出範式轉變——利用大規模影片生成模型中的隱性空間先驗。我們認為，為合成時序連貫的影片，此類模型已內在習得強健的三維結構先驗與物理規律。據此推出VEGA-3D（影片提取生成感知），這種即插即用框架將預訓練影片擴散模型重構為潛在世界模擬器：通過從中間噪聲層提取時空特徵，並經由權杖級自適應門控融合機制與語義表徵集成，無需顯式三維監督即可為多模態大語言模型注入密集幾何線索。在三大基準測試（三維場景理解、空間推理、具身操作）中的廣泛實驗表明，本方法優於現有頂尖基準方案，證實生成式先驗能為物理世界理解提供可擴展基礎。代碼公開於：https://github.com/H-EmbodVis/VEGA-3D。

English

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

生成模型洞悉空間：釋放隱性三維先驗以實現場景理解

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

摘要

Support