Vivid-ZOO：采用扩散模型进行多视角视频生成

摘要

虽然扩散模型在二维图像/视频生成方面表现出色，基于扩散的文本到多视角视频（T2MVid）生成仍未得到充分探索。T2MVid生成带来的新挑战在于缺乏大规模带字幕的多视角视频以及对这种多维分布进行建模的复杂性。为此，我们提出了一种新颖的基于扩散的流程，该流程生成以文本为中心的高质量多视角视频，围绕动态的三维对象。具体而言，我们将T2MVid问题分解为视角空间和时间组件。这种分解使我们能够结合和重复使用先进的预训练多视角图像和二维视频扩散模型的层，以确保生成的多视角视频具有多视角一致性和时间连贯性，大大降低了训练成本。我们进一步引入对齐模块，以对齐来自预训练多视角和二维视频扩散模型的层的潜在空间，解决了由于二维和多视角数据之间的领域差异而产生的重复使用层的不兼容性。为支持当前和未来研究，我们还贡献了一个带字幕的多视角视频数据集。实验结果表明，我们的方法生成了高质量的多视角视频，展现出生动的动作、时间连贯性和多视角一致性，对各种文本提示作出响应。

English

While diffusion models have shown impressive performance in 2D image/video generation, diffusion-based Text-to-Multi-view-Video (T2MVid) generation remains underexplored. The new challenges posed by T2MVid generation lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. To this end, we propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text. Specifically, we factor the T2MVid problem into viewpoint-space and time components. Such factorization allows us to combine and reuse layers of advanced pre-trained multi-view image and 2D video diffusion models to ensure multi-view consistency as well as temporal coherence for the generated multi-view videos, largely reducing the training cost. We further introduce alignment modules to align the latent spaces of layers from the pre-trained multi-view and the 2D video diffusion models, addressing the reused layers' incompatibility that arises from the domain gap between 2D and multi-view data. In support of this and future research, we further contribute a captioned multi-view video dataset. Experimental results demonstrate that our method generates high-quality multi-view videos, exhibiting vivid motions, temporal coherence, and multi-view consistency, given a variety of text prompts.

Vivid-ZOO：采用扩散模型进行多视角视频生成

Vivid-ZOO: Multi-View Video Generation with Diffusion Model

摘要

Support