Gen-L-Video：通过时间协同去噪实现多文本到长视频生成

摘要

借助大规模图像文本数据集和扩散模型的进展，以文本驱动为基础的生成模型在图像生成和编辑领域取得了显著进展。本研究探讨了将文本驱动能力扩展到生成和编辑多文本条件下的长视频的潜力。当前用于视频生成和编辑的方法虽然创新，但通常局限于极短的视频（通常少于24帧），并且仅限于单一文本条件。这些限制显著地限制了它们的应用，因为现实世界中的视频通常由多个段组成，每个段携带不同的语义信息。为了解决这一挑战，我们引入了一种称为Gen-L-Video的新范式，能够将现成的短视频扩散模型扩展到生成和编辑包含数百帧具有不同语义段的视频，而无需额外的训练，同时保持内容一致性。我们实现了三种主流的文本驱动视频生成和编辑方法，并通过我们提出的范式扩展它们，以适应具有各种语义段的更长视频。我们的实验结果显示，我们的方法显著拓宽了视频扩散模型的生成和编辑能力，为未来的研究和应用提供了新的可能性。代码可在https://github.com/G-U-N/Gen-L-Video获取。

English

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

Gen-L-Video：通过时间协同去噪实现多文本到长视频生成

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

摘要

Support