Lyra: ビデオ拡散モデルによる生成的3Dシーン再構成と自己蒸留

要旨

仮想環境を生成する能力は、ゲームからロボティクス、自動運転、産業用AIといった物理的AI領域に至るまでの幅広い応用において極めて重要です。現在の学習ベースの3D再構成手法は、実世界のマルチビューデータの取得に依存していますが、そのようなデータが常に容易に利用できるわけではありません。最近のビデオ拡散モデルの進展は、驚くべき想像力を示していますが、その2D的な性質のため、ロボットが環境をナビゲートし相互作用するシミュレーションへの応用が制限されています。本論文では、ビデオ拡散モデルに内在する暗黙的な3D知識を明示的な3Dガウススプラッティング（3DGS）表現へ蒸留する自己蒸留フレームワークを提案し、マルチビューデータの必要性を排除します。具体的には、典型的なRGBデコーダに3DGSデコーダを追加し、RGBデコーダの出力によって監督します。このアプローチにより、3DGSデコーダはビデオ拡散モデルによって生成された合成データのみで訓練することが可能です。推論時には、本モデルはテキストプロンプトまたは単一画像からリアルタイムレンダリングのための3Dシーンを合成できます。さらに、本フレームワークはモノキュラ入力ビデオからの動的3Dシーン生成にも拡張されます。実験結果は、本フレームワークが静的および動的3Dシーン生成において最先端の性能を達成することを示しています。

English

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

Lyra: ビデオ拡散モデルによる生成的3Dシーン再構成と自己蒸留

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

要旨

Support