4Real: ビデオ拡散モデルによるフォトリアルな4Dシーン生成に向けて

要旨

既存の動的シーン生成手法の多くは、事前学習済みの3D生成モデルから知識を蒸留することに依存しており、これらは通常、合成オブジェクトデータセットでファインチューニングされています。その結果、生成されるシーンはオブジェクト中心になりがちで、フォトリアリズムに欠ける傾向があります。これらの制約を解決するため、私たちはフォトリアリスティックなテキストから4Dシーンを生成するための新しいパイプラインを提案します。このパイプラインは、マルチビュー生成モデルへの依存を排除し、代わりに多様な実世界データセットで学習されたビデオ生成モデルを完全に活用します。私たちの手法では、まずビデオ生成モデルを使用して参照ビデオを生成します。次に、参照ビデオから慎重に生成されたフリーズタイムビデオを使用して、ビデオの正規3D表現を学習します。フリーズタイムビデオの不整合を処理するために、これらの不完全さをモデル化するためのフレームごとの変形を同時に学習します。その後、正規表現に基づいて時間的変形を学習し、参照ビデオ内の動的相互作用を捉えます。このパイプラインにより、複数の視点から見ることができる、フォトリアリズムと構造的整合性が強化された動的シーンの生成が可能となり、4Dシーン生成において新たな基準を確立します。

English

Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

4Real: ビデオ拡散モデルによるフォトリアルな4Dシーン生成に向けて

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

要旨

Support