Lyra:基于视频扩散模型与自蒸馏的生成式三维场景重建
Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation
September 23, 2025
作者: Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, Xuanchi Ren
cs.AI
摘要
生成虚拟环境的能力对于从游戏到物理AI领域(如机器人、自动驾驶和工业AI)的各类应用至关重要。当前基于学习的三维重建方法依赖于获取真实世界的多视角数据,而这些数据并非总是易于获得。尽管视频扩散模型的最新进展展现了卓越的想象力,但其二维特性限制了其在需要机器人导航与环境交互的仿真应用中的使用。本文提出了一种自蒸馏框架,旨在将视频扩散模型中隐含的三维知识蒸馏为显式的三维高斯溅射(3DGS)表示,从而无需多视角训练数据。具体而言,我们在常规的RGB解码器基础上增加了一个3DGS解码器,该解码器由RGB解码器的输出进行监督。通过这种方法,3DGS解码器可以仅使用视频扩散模型生成的合成数据进行训练。在推理阶段,我们的模型能够根据文本提示或单张图像实时渲染生成三维场景。此外,我们的框架还扩展至从单目输入视频生成动态三维场景。实验结果表明,该框架在静态和动态三维场景生成方面均达到了业界领先水平。
English
The ability to generate virtual environments is crucial for applications
ranging from gaming to physical AI domains such as robotics, autonomous
driving, and industrial AI. Current learning-based 3D reconstruction methods
rely on the availability of captured real-world multi-view data, which is not
always readily available. Recent advancements in video diffusion models have
shown remarkable imagination capabilities, yet their 2D nature limits the
applications to simulation where a robot needs to navigate and interact with
the environment. In this paper, we propose a self-distillation framework that
aims to distill the implicit 3D knowledge in the video diffusion models into an
explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for
multi-view training data. Specifically, we augment the typical RGB decoder with
a 3DGS decoder, which is supervised by the output of the RGB decoder. In this
approach, the 3DGS decoder can be purely trained with synthetic data generated
by video diffusion models. At inference time, our model can synthesize 3D
scenes from either a text prompt or a single image for real-time rendering. Our
framework further extends to dynamic 3D scene generation from a monocular input
video. Experimental results show that our framework achieves state-of-the-art
performance in static and dynamic 3D scene generation.