Gen-L-Video：通過時間協同去噪實現多文本到長視頻生成

摘要

憑藉大規模圖像文字數據集和擴散模型的進步，以文本驅動的生成模型在圖像生成和編輯領域取得了顯著進展。本研究探討將文本驅動能力擴展到生成和編輯多文本條件下的長視頻的潛力。目前的視頻生成和編輯方法雖然創新，但通常僅限於極短的視頻（通常少於24幀），並且僅限於單一文本條件。這些限制顯著限制了它們的應用，因為現實世界的視頻通常由多個部分組成，每個部分都包含不同的語義信息。為應對這一挑戰，我們提出了一種名為Gen-L-Video的新範式，能夠將現成的短視頻擴散模型擴展到生成和編輯包含數百幀具有多樣語義片段的視頻，而無需進行額外的訓練，同時保持內容一致性。我們實現了三種主流的文本驅動視頻生成和編輯方法，並擴展了它們以適應具有各種語義片段的長視頻。我們的實驗結果顯示，我們的方法顯著擴展了視頻擴散模型的生成和編輯能力，為未來的研究和應用提供了新的可能性。代碼可在https://github.com/G-U-N/Gen-L-Video找到。

English

Leveraging large-scale image-text datasets and advancements in diffusion models, text-driven generative models have made remarkable strides in the field of image generation and editing. This study explores the potential of extending the text-driven ability to the generation and editing of multi-text conditioned long videos. Current methodologies for video generation and editing, while innovative, are often confined to extremely short videos (typically less than 24 frames) and are limited to a single text condition. These constraints significantly limit their applications given that real-world videos usually consist of multiple segments, each bearing different semantic information. To address this challenge, we introduce a novel paradigm dubbed as Gen-L-Video, capable of extending off-the-shelf short video diffusion models for generating and editing videos comprising hundreds of frames with diverse semantic segments without introducing additional training, all while preserving content consistency. We have implemented three mainstream text-driven video generation and editing methodologies and extended them to accommodate longer videos imbued with a variety of semantic segments with our proposed paradigm. Our experimental outcomes reveal that our approach significantly broadens the generative and editing capabilities of video diffusion models, offering new possibilities for future research and applications. The code is available at https://github.com/G-U-N/Gen-L-Video.

Gen-L-Video：通過時間協同去噪實現多文本到長視頻生成

Gen-L-Video: Multi-Text to Long Video Generation via Temporal Co-Denoising

摘要

Support