Gaussische Variatieveld Diffusie voor Hoogwaardige Video-naar-4D Synthese

Samenvatting

In dit artikel presenteren we een nieuw framework voor video-naar-4D-generatie dat hoogwaardige dynamische 3D-inhoud creëert vanuit enkele video-inputs. Directe 4D-diffusiemodellering is extreem uitdagend vanwege de kostbare dataconstructie en de hoogdimensionale aard van het gezamenlijk representeren van 3D-vorm, uiterlijk en beweging. We gaan deze uitdagingen aan door een Direct 4DMesh-to-GS Variation Field VAE te introduceren die canonieke Gaussian Splats (GS) en hun temporele variaties direct codeert vanuit 3D-animatiedata zonder per-instantie aanpassing, en hoogdimensionale animaties comprimeert naar een compacte latente ruimte. Op basis van deze efficiënte representatie trainen we een Gaussian Variation Field-diffusiemodel met een temporeel bewuste Diffusion Transformer, geconditioneerd op inputvideo's en canonieke GS. Getraind op zorgvuldig geselecteerde animeerbare 3D-objecten uit de Objaverse-dataset, toont ons model superieure generatiekwaliteit in vergelijking met bestaande methoden. Het vertoont ook opmerkelijke generalisatie naar video-inputs uit de echte wereld, ondanks dat het uitsluitend op synthetische data is getraind, wat de weg vrijmaakt voor het genereren van hoogwaardige geanimeerde 3D-inhoud. Projectpagina: https://gvfdiffusion.github.io/.

English

In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.

Gaussische Variatieveld Diffusie voor Hoogwaardige Video-naar-4D Synthese

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

Samenvatting

Support