Lyra：基于视频扩散模型与自蒸馏的生成式三维场景重建

摘要

生成虛擬環境的能力對於從遊戲到物理人工智慧領域（如機器人技術、自動駕駛和工業人工智慧）的應用至關重要。當前基於學習的三維重建方法依賴於捕捉到的真實世界多視角數據的可用性，而這些數據並非總是易於獲取。近年來，視頻擴散模型的進展展現了卓越的想像能力，但其二維特性限制了在模擬中的應用，尤其是在機器人需要導航並與環境互動的場景中。本文提出了一種自我蒸餾框架，旨在將視頻擴散模型中的隱式三維知識蒸餾成顯式的三維高斯濺射（3DGS）表示，從而消除對多視角訓練數據的需求。具體而言，我們在典型的RGB解碼器基礎上增加了3DGS解碼器，並以RGB解碼器的輸出作為監督信號。通過這種方式，3DGS解碼器可以僅使用視頻擴散模型生成的合成數據進行訓練。在推理階段，我們的模型能夠根據文本提示或單張圖像實時合成三維場景。此外，我們的框架還進一步擴展到從單目輸入視頻生成動態三維場景的能力。實驗結果表明，我們的框架在靜態和動態三維場景生成方面達到了最先進的性能。

English

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

Lyra：基于视频扩散模型与自蒸馏的生成式三维场景重建

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

摘要

Support