ChatPaper.aiChatPaper

高斯變分場擴散用於高保真視頻到4D合成

Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis

July 31, 2025
作者: Bowen Zhang, Sicheng Xu, Chuxin Wang, Jiaolong Yang, Feng Zhao, Dong Chen, Baining Guo
cs.AI

摘要

本文提出了一種新穎的視頻到4D生成框架,能夠從單一視頻輸入創建高質量的動態3D內容。直接進行4D擴散建模極具挑戰性,原因在於數據構建成本高昂,以及同時表示3D形狀、外觀和運動的高維特性。我們通過引入一種直接4D網格到高斯變分場的變分自編碼器(Direct 4DMesh-to-GS Variation Field VAE)來應對這些挑戰,該模型直接從3D動畫數據中編碼規範高斯散點(GS)及其時間變化,無需逐實例擬合,並將高維動畫壓縮至緊湊的潛在空間。基於這一高效表示,我們訓練了一個高斯變分場擴散模型,該模型配備了時間感知的擴散變換器,並以輸入視頻和規範GS為條件。在Objaverse數據集中精心策劃的可動畫3D對象上訓練後,我們的模型展現出相較現有方法更優的生成質量。儘管僅在合成數據上訓練,它對真實世界視頻輸入也表現出顯著的泛化能力,為生成高質量動畫3D內容開闢了新途徑。項目頁面:https://gvfdiffusion.github.io/。
English
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.
PDF162August 7, 2025