高精細ビデオから4D合成のためのガウス変分場拡散

要旨

本論文では、単一のビデオ入力から高品質な動的3Dコンテンツを生成する新しいビデオ-to-4D生成フレームワークを提案する。直接的な4D拡散モデリングは、データ構築のコストが高く、3D形状、外観、および動きを同時に表現する高次元性のため、非常に困難である。これらの課題に対処するため、我々はDirect 4DMesh-to-GS Variation Field VAEを導入し、3Dアニメーションデータから正準ガウシアンスプラット（GS）とその時間的変動を直接エンコードし、高次元アニメーションをコンパクトな潜在空間に圧縮する。この効率的な表現を基に、入力ビデオと正準GSを条件とした時間認識型Diffusion Transformerを用いて、ガウシアン変動場拡散モデルを学習する。Objaverseデータセットから厳選されたアニメーション可能な3Dオブジェクトで学習した結果、我々のモデルは既存手法と比較して優れた生成品質を示した。また、合成データのみで学習しているにもかかわらず、実世界のビデオ入力に対して顕著な汎化性能を発揮し、高品質なアニメーション3Dコンテンツ生成への道を開いた。プロジェクトページ: https://gvfdiffusion.github.io/。

English

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

高精細ビデオから4D合成のためのガウス変分場拡散

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

要旨

Support