Zero4D：既存のビデオ拡散モデルを用いた単一ビデオからのトレーニング不要な4Dビデオ生成

要旨

近年、マルチビューや4Dビデオ生成が重要な研究テーマとして浮上しています。しかし、最近の4D生成アプローチは依然として根本的な制限に直面しており、複数のビデオ拡散モデルを活用するか、限られた実世界の4Dデータと大きな計算コストを伴う完全な4D拡散モデルの訓練に依存していることが主な原因です。これらの課題に対処するため、本研究では、既存のビデオ拡散モデルを活用して単一の入力ビデオからマルチビュービデオを生成する、初の訓練不要な4Dビデオ生成手法を提案します。我々のアプローチは以下の2つの主要なステップで構成されます：(1) 時空間サンプリンググリッドの端のフレームをキーフレームとして指定し、深度ベースのワーピング技術をガイダンスとして利用して、ビデオ拡散モデルを用いてそれらを最初に合成します。このアプローチにより、生成されたフレーム間の構造的一貫性が保証され、空間的および時間的整合性が維持されます。(2) 次に、ビデオ拡散モデルを用いて残りのフレームを補間し、空間的および時間的整合性を保ちながら、完全に埋められた時空間サンプリンググリッドを構築します。このアプローチを通じて、単一のビデオを新しいカメラ軌道に沿ってマルチビュービデオに拡張し、時空間的整合性を維持します。我々の手法は訓練不要であり、既存のビデオ拡散モデルを完全に活用するため、マルチビュービデオ生成に対する実用的で効果的なソリューションを提供します。

English

Recently, multi-view or 4D video generation has emerged as a significant research topic. Nonetheless, recent approaches to 4D generation still struggle with fundamental limitations, as they primarily rely on harnessing multiple video diffusion models with additional training or compute-intensive training of a full 4D diffusion model with limited real-world 4D data and large computational costs. To address these challenges, here we propose the first training-free 4D video generation method that leverages the off-the-shelf video diffusion models to generate multi-view videos from a single input video. Our approach consists of two key steps: (1) By designating the edge frames in the spatio-temporal sampling grid as key frames, we first synthesize them using a video diffusion model, leveraging a depth-based warping technique for guidance. This approach ensures structural consistency across the generated frames, preserving spatial and temporal coherence. (2) We then interpolate the remaining frames using a video diffusion model, constructing a fully populated and temporally coherent sampling grid while preserving spatial and temporal consistency. Through this approach, we extend a single video into a multi-view video along novel camera trajectories while maintaining spatio-temporal consistency. Our method is training-free and fully utilizes an off-the-shelf video diffusion model, offering a practical and effective solution for multi-view video generation.

Zero4D：既存のビデオ拡散モデルを用いた単一ビデオからのトレーニング不要な4Dビデオ生成

Zero4D: Training-Free 4D Video Generation From Single Video Using Off-the-Shelf Video Diffusion Model

要旨

Support