ChatPaper.aiChatPaper

高斯变分场扩散技术实现高保真视频到4D合成

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

July 31, 2025
作者: Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, Baining Guo
cs.AI

摘要

本文提出了一种新颖的视频到4D生成框架,能够从单一视频输入中创建高质量的动态3D内容。直接进行4D扩散建模极具挑战性,原因在于数据构建成本高昂以及同时表示3D形状、外观和运动的高维特性。我们通过引入一种直接4DMesh到GS变分场VAE来解决这些难题,该模型无需逐实例拟合,即可直接从3D动画数据中编码规范高斯溅射(GS)及其时间变化,并将高维动画压缩至紧凑的潜在空间。基于这一高效表示,我们训练了一个高斯变分场扩散模型,该模型采用时间感知的扩散Transformer,并以输入视频和规范GS为条件。通过在Objaverse数据集中精心挑选的可动画3D对象上进行训练,我们的模型在生成质量上优于现有方法。尽管仅使用合成数据进行训练,该模型对真实世界视频输入展现出了显著的泛化能力,为生成高质量动画3D内容开辟了新途径。项目页面:https://gvfdiffusion.github.io/。
English
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.
PDF162August 7, 2025