VideoFrom3D: 補完的な画像とビデオ拡散モデルによる3Dシーンビデオ生成

要旨

本論文では、粗いジオメトリ、カメラ軌道、および参照画像から高品質な3Dシーンビデオを合成するための新しいフレームワークであるVideoFrom3Dを提案します。本手法は、3Dグラフィックデザインワークフローを合理化し、柔軟なデザイン探索と迅速な成果物の制作を可能にします。粗いジオメトリからビデオを合成するための単純なアプローチとして、ビデオ拡散モデルを幾何学的構造に基づいて条件付ける方法が考えられます。しかし、既存のビデオ拡散モデルは、視覚品質、動き、および時間的一貫性を同時にモデル化する難しさから、複雑なシーンに対して高忠実度の結果を生成することが困難です。この問題に対処するため、我々は画像拡散モデルとビデオ拡散モデルの補完的な強みを活用する生成フレームワークを提案します。具体的には、本フレームワークはSparse Anchor-view Generation (SAG) とGeometry-guided Generative Inbetweening (GGI) モジュールで構成されます。SAGモジュールは、Sparse Appearance-guided Samplingを利用して、画像拡散モデルを用いて高品質で視点間の一貫性のあるアンカービューを生成します。これらのアンカービューに基づいて、GGIモジュールは、フローベースのカメラ制御と構造的ガイダンスを強化したビデオ拡散モデルを使用して、中間フレームを忠実に補間します。特に、両モジュールは、3Dシーンモデルと自然画像のペアデータセットを必要とせずに動作しますが、このようなデータセットは極めて入手困難です。包括的な実験により、本手法が多様で挑戦的なシナリオにおいて、高品質でスタイルの一貫したシーンビデオを生成し、単純なベースラインおよび拡張ベースラインを上回ることを示します。

English

In this paper, we propose VideoFrom3D, a novel framework for synthesizing high-quality 3D scene videos from coarse geometry, a camera trajectory, and a reference image. Our approach streamlines the 3D graphic design workflow, enabling flexible design exploration and rapid production of deliverables. A straightforward approach to synthesizing a video from coarse geometry might condition a video diffusion model on geometric structure. However, existing video diffusion models struggle to generate high-fidelity results for complex scenes due to the difficulty of jointly modeling visual quality, motion, and temporal consistency. To address this, we propose a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, our framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance. Notably, both modules operate without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain. Comprehensive experiments show that our method produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines.

VideoFrom3D: 補完的な画像とビデオ拡散モデルによる3Dシーンビデオ生成

VideoFrom3D: 3D Scene Video Generation via Complementary Image and Video Diffusion Models

要旨

Support