Vid3D:使用2D视频扩散合成动态3D场景
Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
June 17, 2024
作者: Rishab Parthasarathy, Zack Ankner, Aaron Gokaslan
cs.AI
摘要
计算机视觉的一个最新前沿是3D视频生成任务,即生成场景的时间变化的3D表示。为了生成动态的3D场景,当前方法通过联合优化场景的时间和视图的一致性来明确建模3D时间动态。在本文中,我们研究当前方法是否需要像现有方法那样明确强制执行随时间的多视图一致性,或者模型独立生成每个时间步的3D表示是否足够。因此,我们提出了一个名为Vid3D的模型,它利用2D视频扩散通过首先生成视频时间动态的2D“种子”,然后独立为种子视频中的每个时间步生成3D表示。我们将Vid3D与两种最先进的3D视频生成方法进行了评估,并发现尽管不明确建模3D时间动态,Vid3D实现了可比较的结果。我们进一步消融了Vid3D的质量如何取决于每帧生成的视图数量。虽然我们观察到视图较少时会有一些降级,但性能降级仍然较小。因此,我们的结果表明,为了生成高质量的动态3D场景,可能并不需要3D时间知识,这可能为该任务提供了更简单的生成算法。
English
A recent frontier in computer vision has been the task of 3D video
generation, which consists of generating a time-varying 3D representation of a
scene. To generate dynamic 3D scenes, current methods explicitly model 3D
temporal dynamics by jointly optimizing for consistency across both time and
views of the scene. In this paper, we instead investigate whether it is
necessary to explicitly enforce multiview consistency over time, as current
approaches do, or if it is sufficient for a model to generate 3D
representations of each timestep independently. We hence propose a model,
Vid3D, that leverages 2D video diffusion to generate 3D videos by first
generating a 2D "seed" of the video's temporal dynamics and then independently
generating a 3D representation for each timestep in the seed video. We evaluate
Vid3D against two state-of-the-art 3D video generation methods and find that
Vid3D is achieves comparable results despite not explicitly modeling 3D
temporal dynamics. We further ablate how the quality of Vid3D depends on the
number of views generated per frame. While we observe some degradation with
fewer views, performance degradation remains minor. Our results thus suggest
that 3D temporal knowledge may not be necessary to generate high-quality
dynamic 3D scenes, potentially enabling simpler generative algorithms for this
task.Summary
AI-Generated Summary