FreeLong：使用SpectralBlend暫時關注實現無需訓練的長視頻生成

摘要

影片擴散模型在各種影片生成應用中取得了顯著進展。然而，為了長影片生成任務訓練模型需要大量的計算和數據資源，這對於發展長影片擴散模型構成了挑戰。本文探討了一種直接且無需訓練的方法，用於擴展現有的短影片擴散模型（例如在16幀影片上預先訓練）以實現一致的長影片生成（例如128幀）。我們的初步觀察發現，直接應用短影片擴散模型生成長影片可能導致嚴重的影片質量降低。進一步的研究顯示，這種降質主要是由於長影片中高頻組件的失真所致，其特徵是空間高頻組件減少，時間高頻組件增加。受此啟發，我們提出了一種名為FreeLong的新解決方案，用於在去噪過程中平衡長影片特徵的頻率分佈。FreeLong將全局影片特徵的低頻組件（涵蓋整個影片序列）與局部影片特徵的高頻組件（專注於較短的幀子序列）相融合。這種方法在保持全局一致性的同時，還從局部影片中納入多樣且高質量的時空細節，增強了長影片生成的一致性和保真度。我們在多個基礎影片擴散模型上評估了FreeLong，觀察到了顯著的改進。此外，我們的方法支持連貫的多提示生成，確保視覺連貫性和場景之間的無縫過渡。

English

Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Our preliminary observation has found that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Further investigation reveals that this degradation is primarily due to the distortion of high-frequency components in long videos, characterized by a decrease in spatial high-frequency components and an increase in temporal high-frequency components. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process. FreeLong blends the low-frequency components of global video features, which encapsulate the entire video sequence, with the high-frequency components of local video features that focus on shorter subsequences of frames. This approach maintains global consistency while incorporating diverse and high-quality spatiotemporal details from local videos, enhancing both the consistency and fidelity of long video generation. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.

FreeLong：使用SpectralBlend暫時關注實現無需訓練的長視頻生成

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

摘要

Support