4Real：通过视频扩散模型实现逼真的4D场景生成

摘要

现有的动态场景生成方法主要依赖于从预训练的3D生成模型中提炼知识，这些模型通常在合成物体数据集上进行微调。因此，生成的场景通常以物体为中心，缺乏照片般的逼真感。为了解决这些局限性，我们引入了一种新颖的流程，旨在实现逼真的文本到4D场景生成，摒弃了对多视角生成模型的依赖，而是充分利用在多样的真实世界数据集上训练的视频生成模型。我们的方法首先利用视频生成模型生成参考视频。然后，我们使用从参考视频精心生成的冻结时间视频来学习视频的规范3D表示。为了处理冻结时间视频中的不一致性，我们共同学习逐帧变形来建模这些缺陷。然后，我们学习基于规范表示的时间变形，以捕捉参考视频中的动态交互。该流程促进了具有增强逼真感和结构完整性的动态场景生成，可以从多个视角查看，从而在4D场景生成中树立了新的标准。

English

Existing dynamic scene generation methods mostly rely on distilling knowledge from pre-trained 3D generative models, which are typically fine-tuned on synthetic object datasets. As a result, the generated scenes are often object-centric and lack photorealism. To address these limitations, we introduce a novel pipeline designed for photorealistic text-to-4D scene generation, discarding the dependency on multi-view generative models and instead fully utilizing video generative models trained on diverse real-world datasets. Our method begins by generating a reference video using the video generation model. We then learn the canonical 3D representation of the video using a freeze-time video, delicately generated from the reference video. To handle inconsistencies in the freeze-time video, we jointly learn a per-frame deformation to model these imperfections. We then learn the temporal deformation based on the canonical representation to capture dynamic interactions in the reference video. The pipeline facilitates the generation of dynamic scenes with enhanced photorealism and structural integrity, viewable from multiple perspectives, thereby setting a new standard in 4D scene generation.

4Real：通过视频扩散模型实现逼真的4D场景生成

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

摘要

Support