FreeNoise:通过噪声重新调度实现无需调整的更长视频扩散
FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling
October 23, 2023
作者: Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, Ziwei Liu
cs.AI
摘要
随着大规模视频数据集的可用性和扩散模型的进步,基于文本驱动的视频生成取得了显著进展。然而,现有的视频生成模型通常是在有限数量的帧上进行训练,导致在推断过程中无法生成高保真度的长视频。此外,这些模型仅支持单文本条件,而实际场景通常需要多文本条件,因为视频内容随时间变化。为了解决这些挑战,本研究探讨了将基于文本驱动的能力扩展到多文本条件下生成更长视频的潜力。首先,我们分析了视频扩散模型中初始噪声的影响。然后,在观察到噪声的基础上,我们提出了FreeNoise,这是一种无需调整且高效的范式,可增强预训练视频扩散模型的生成能力,同时保持内容一致性。具体而言,我们不是为所有帧初始化噪声,而是为长距离相关性重新安排一系列噪声,并通过基于窗口的函数对其执行时间注意力。此外,我们设计了一种新颖的运动注入方法,以支持基于多个文本提示生成视频。大量实验证实了我们的范式在扩展视频扩散模型生成能力方面的优越性。值得注意的是,与之前表现最佳的方法相比,该方法带来了额外255%的时间成本,而我们的方法仅产生大约17%的可忽略时间成本。生成的视频样本可在我们的网站上找到:http://haonanqiu.com/projects/FreeNoise.html。
English
With the availability of large-scale video datasets and the advances of
diffusion models, text-driven video generation has achieved substantial
progress. However, existing video generation models are typically trained on a
limited number of frames, resulting in the inability to generate high-fidelity
long videos during inference. Furthermore, these models only support
single-text conditions, whereas real-life scenarios often require multi-text
conditions as the video content changes over time. To tackle these challenges,
this study explores the potential of extending the text-driven capability to
generate longer videos conditioned on multiple texts. 1) We first analyze the
impact of initial noise in video diffusion models. Then building upon the
observation of noise, we propose FreeNoise, a tuning-free and time-efficient
paradigm to enhance the generative capabilities of pretrained video diffusion
models while preserving content consistency. Specifically, instead of
initializing noises for all frames, we reschedule a sequence of noises for
long-range correlation and perform temporal attention over them by window-based
function. 2) Additionally, we design a novel motion injection method to support
the generation of videos conditioned on multiple text prompts. Extensive
experiments validate the superiority of our paradigm in extending the
generative capabilities of video diffusion models. It is noteworthy that
compared with the previous best-performing method which brought about 255%
extra time cost, our method incurs only negligible time cost of approximately
17%. Generated video samples are available at our website:
http://haonanqiu.com/projects/FreeNoise.html.