FreeLong:使用SpectralBlend暫時關注實現無需訓練的長視頻生成
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention
July 29, 2024
作者: Yu Lu, Yuanzhi Liang, Linchao Zhu, Yi Yang
cs.AI
摘要
影片擴散模型在各種影片生成應用中取得了顯著進展。然而,為了長影片生成任務訓練模型需要大量的計算和數據資源,這對於發展長影片擴散模型構成了挑戰。本文探討了一種直接且無需訓練的方法,用於擴展現有的短影片擴散模型(例如在16幀影片上預先訓練)以實現一致的長影片生成(例如128幀)。我們的初步觀察發現,直接應用短影片擴散模型生成長影片可能導致嚴重的影片質量降低。進一步的研究顯示,這種降質主要是由於長影片中高頻組件的失真所致,其特徵是空間高頻組件減少,時間高頻組件增加。受此啟發,我們提出了一種名為FreeLong的新解決方案,用於在去噪過程中平衡長影片特徵的頻率分佈。FreeLong將全局影片特徵的低頻組件(涵蓋整個影片序列)與局部影片特徵的高頻組件(專注於較短的幀子序列)相融合。這種方法在保持全局一致性的同時,還從局部影片中納入多樣且高質量的時空細節,增強了長影片生成的一致性和保真度。我們在多個基礎影片擴散模型上評估了FreeLong,觀察到了顯著的改進。此外,我們的方法支持連貫的多提示生成,確保視覺連貫性和場景之間的無縫過渡。
English
Video diffusion models have made substantial progress in various video
generation applications. However, training models for long video generation
tasks require significant computational and data resources, posing a challenge
to developing long video diffusion models. This paper investigates a
straightforward and training-free approach to extend an existing short video
diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video
generation (e.g. 128 frames). Our preliminary observation has found that
directly applying the short video diffusion model to generate long videos can
lead to severe video quality degradation. Further investigation reveals that
this degradation is primarily due to the distortion of high-frequency
components in long videos, characterized by a decrease in spatial
high-frequency components and an increase in temporal high-frequency
components. Motivated by this, we propose a novel solution named FreeLong to
balance the frequency distribution of long video features during the denoising
process. FreeLong blends the low-frequency components of global video features,
which encapsulate the entire video sequence, with the high-frequency components
of local video features that focus on shorter subsequences of frames. This
approach maintains global consistency while incorporating diverse and
high-quality spatiotemporal details from local videos, enhancing both the
consistency and fidelity of long video generation. We evaluated FreeLong on
multiple base video diffusion models and observed significant improvements.
Additionally, our method supports coherent multi-prompt generation, ensuring
both visual coherence and seamless transitions between scenes.Summary
AI-Generated Summary