Gen-L-Video:通过时间协同去噪实现多文本到长视频生成
Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising
May 29, 2023
作者: Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, Hongsheng Li
cs.AI
摘要
借助大规模图像文本数据集和扩散模型的进展,以文本驱动为基础的生成模型在图像生成和编辑领域取得了显著进展。本研究探讨了将文本驱动能力扩展到生成和编辑多文本条件下的长视频的潜力。当前用于视频生成和编辑的方法虽然创新,但通常局限于极短的视频(通常少于24帧),并且仅限于单一文本条件。这些限制显著地限制了它们的应用,因为现实世界中的视频通常由多个段组成,每个段携带不同的语义信息。为了解决这一挑战,我们引入了一种称为Gen-L-Video的新范式,能够将现成的短视频扩散模型扩展到生成和编辑包含数百帧具有不同语义段的视频,而无需额外的训练,同时保持内容一致性。我们实现了三种主流的文本驱动视频生成和编辑方法,并通过我们提出的范式扩展它们,以适应具有各种语义段的更长视频。我们的实验结果显示,我们的方法显著拓宽了视频扩散模型的生成和编辑能力,为未来的研究和应用提供了新的可能性。代码可在https://github.com/G-U-N/Gen-L-Video获取。
English
Leveraging large-scale image-text datasets and advancements in diffusion
models, text-driven generative models have made remarkable strides in the field
of image generation and editing. This study explores the potential of extending
the text-driven ability to the generation and editing of multi-text conditioned
long videos. Current methodologies for video generation and editing, while
innovative, are often confined to extremely short videos (typically less than
24 frames) and are limited to a single text condition. These constraints
significantly limit their applications given that real-world videos usually
consist of multiple segments, each bearing different semantic information. To
address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video,
capable of extending off-the-shelf short video diffusion models for generating
and editing videos comprising hundreds of frames with diverse semantic segments
without introducing additional training, all while preserving content
consistency. We have implemented three mainstream text-driven video generation
and editing methodologies and extended them to accommodate longer videos imbued
with a variety of semantic segments with our proposed paradigm. Our
experimental outcomes reveal that our approach significantly broadens the
generative and editing capabilities of video diffusion models, offering new
possibilities for future research and applications. The code is available at
https://github.com/G-U-N/Gen-L-Video.