4Real:通過視頻擴散模型實現逼真的4D場景生成
4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models
June 11, 2024
作者: Heng Yu, Chaoyang Wang, Peiye Zhuang, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Laszlo A Jeni, Sergey Tulyakov, Hsin-Ying Lee
cs.AI
摘要
現有的動態場景生成方法主要依賴於從預先訓練的3D生成模型中提煉知識,這些模型通常在合成物體數據集上進行微調。因此,生成的場景通常以物體為中心,缺乏照片逼真度。為了解決這些限制,我們提出了一種新的流程,專為照片逼真的文本到4D場景生成而設計,並且不依賴於多視圖生成模型,而是充分利用在多樣真實世界數據集上訓練的視頻生成模型。我們的方法首先使用視頻生成模型生成參考視頻。然後,我們通過從參考視頻精心生成的凍結時間視頻來學習視頻的規範3D表示。為了處理凍結時間視頻中的不一致性,我們共同學習每幀變形,以建模這些缺陷。然後,我們基於規範表示學習時間變形,以捕捉參考視頻中的動態交互作用。這個流程促進了具有增強照片逼真度和結構完整性的動態場景生成,可以從多個角度觀看,從而確立了4D場景生成的新標準。
English
Existing dynamic scene generation methods mostly rely on distilling knowledge
from pre-trained 3D generative models, which are typically fine-tuned on
synthetic object datasets. As a result, the generated scenes are often
object-centric and lack photorealism. To address these limitations, we
introduce a novel pipeline designed for photorealistic text-to-4D scene
generation, discarding the dependency on multi-view generative models and
instead fully utilizing video generative models trained on diverse real-world
datasets. Our method begins by generating a reference video using the video
generation model. We then learn the canonical 3D representation of the video
using a freeze-time video, delicately generated from the reference video. To
handle inconsistencies in the freeze-time video, we jointly learn a per-frame
deformation to model these imperfections. We then learn the temporal
deformation based on the canonical representation to capture dynamic
interactions in the reference video. The pipeline facilitates the generation of
dynamic scenes with enhanced photorealism and structural integrity, viewable
from multiple perspectives, thereby setting a new standard in 4D scene
generation.Summary
AI-Generated Summary