ChatPaper.aiChatPaper

FreeNoise:透過噪音重新安排實現無需調整的長視頻擴散

FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling

October 23, 2023
作者: Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu
cs.AI

摘要

隨著大規模視頻數據集的可用性和擴散模型的進步,以文本驅動的視頻生成取得了顯著進展。然而,現有的視頻生成模型通常是在有限數量的幀上進行訓練,導致在推斷期間無法生成高保真度的長視頻。此外,這些模型僅支持單文本條件,而現實生活中的情況通常需要多文本條件,因為視頻內容隨時間變化。為應對這些挑戰,本研究探討了將以文本驅動的能力擴展到以多個文本為條件生成較長視頻的潛力。1) 我們首先分析了視頻擴散模型中初始噪聲的影響。然後基於對噪聲的觀察,我們提出了FreeNoise,這是一種無需調整且高效的範式,可增強預訓練視頻擴散模型的生成能力,同時保持內容一致性。具體來說,我們不是為所有幀初始化噪聲,而是重新安排一系列噪聲以實現長距離相關性,並通過基於窗口的功能對其執行時間注意力。2) 此外,我們設計了一種新穎的運動注入方法,以支持生成基於多個文本提示的視頻。大量實驗驗證了我們範式在擴展視頻擴散模型生成能力方面的優越性。值得注意的是,與之前性能最佳的方法相比,該方法增加了255%的額外時間成本,而我們的方法僅產生約17%的可忽略時間成本。生成的視頻樣本可在我們的網站上找到:http://haonanqiu.com/projects/FreeNoise.html。
English
With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
PDF100December 15, 2024