ChatPaper.aiChatPaper

FreeLong:使用SpectralBlend时域注意力实现无需训练的长视频生成

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

July 29, 2024
作者: Yu Lu, Yuanzhi Liang, Linchao Zhu, Yi Yang
cs.AI

摘要

视频扩散模型在各种视频生成应用中取得了显著进展。然而,为长视频生成任务训练模型需要大量的计算和数据资源,这对开发长视频扩散模型构成了挑战。本文研究了一种简单且无需训练的方法,用于扩展现有的短视频扩散模型(例如,在16帧视频上预训练)以实现一致的长视频生成(例如,128帧)。我们的初步观察发现,直接将短视频扩散模型应用于生成长视频可能导致严重的视频质量下降。进一步的研究表明,这种下降主要是由于长视频中高频组件的失真所致,其特征是空间高频组件减少,而时间高频组件增加。受此启发,我们提出了一种名为FreeLong的新颖解决方案,用于在去噪过程中平衡长视频特征的频率分布。FreeLong将全局视频特征的低频组件(涵盖整个视频序列)与局部视频特征的高频组件(聚焦于较短的帧子序列)相融合。这种方法既保持了全局一致性,又融入了来自局部视频的多样且高质量的时空细节,增强了长视频生成的一致性和保真度。我们在多个基础视频扩散模型上评估了FreeLong,并观察到了显著的改进。此外,我们的方法支持连贯的多提示生成,确保视觉连贯性和场景之间的无缝过渡。
English
Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Our preliminary observation has found that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Further investigation reveals that this degradation is primarily due to the distortion of high-frequency components in long videos, characterized by a decrease in spatial high-frequency components and an increase in temporal high-frequency components. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process. FreeLong blends the low-frequency components of global video features, which encapsulate the entire video sequence, with the high-frequency components of local video features that focus on shorter subsequences of frames. This approach maintains global consistency while incorporating diverse and high-quality spatiotemporal details from local videos, enhancing both the consistency and fidelity of long video generation. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.

Summary

AI-Generated Summary

PDF522November 28, 2024