FreeLong: スペクトラルブレンド時系列アテンションを用いたトレーニング不要の長尺動画生成

要旨

ビデオ拡散モデルは、様々なビデオ生成アプリケーションにおいて大幅な進展を遂げてきました。しかし、長いビデオ生成タスクのためのモデルを訓練するには、多大な計算リソースとデータリソースが必要であり、長いビデオ拡散モデルの開発に課題を投げかけています。本論文では、既存の短いビデオ拡散モデル（例えば、16フレームのビデオで事前訓練されたモデル）を一貫した長いビデオ生成（例えば、128フレーム）に拡張するための、シンプルで訓練不要なアプローチを探求します。我々の予備的な観察では、短いビデオ拡散モデルを直接長いビデオ生成に適用すると、ビデオ品質の大幅な劣化が生じることがわかりました。さらに調査を進めた結果、この劣化は主に長いビデオにおける高周波成分の歪みによるものであり、空間的な高周波成分の減少と時間的な高周波成分の増加が特徴であることが明らかになりました。これに動機づけられて、我々は、長いビデオの特徴量の周波数分布をノイズ除去プロセス中にバランスさせるための新しい解決策「FreeLong」を提案します。FreeLongは、ビデオシーケンス全体を包含するグローバルなビデオ特徴量の低周波成分と、より短いフレームのサブシーケンスに焦点を当てたローカルなビデオ特徴量の高周波成分をブレンドします。このアプローチは、グローバルな一貫性を維持しながら、ローカルなビデオから多様で高品質な時空間的詳細を取り入れ、長いビデオ生成の一貫性と忠実度を向上させます。我々は、複数のベースビデオ拡散モデルに対してFreeLongを評価し、大幅な改善を確認しました。さらに、我々の手法は、視覚的な一貫性とシーン間のシームレスな遷移を保証する、コヒーレントなマルチプロンプト生成をサポートします。

English

Video diffusion models have made substantial progress in various video generation applications. However, training models for long video generation tasks require significant computational and data resources, posing a challenge to developing long video diffusion models. This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model (e.g. pre-trained on 16-frame videos) for consistent long video generation (e.g. 128 frames). Our preliminary observation has found that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Further investigation reveals that this degradation is primarily due to the distortion of high-frequency components in long videos, characterized by a decrease in spatial high-frequency components and an increase in temporal high-frequency components. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process. FreeLong blends the low-frequency components of global video features, which encapsulate the entire video sequence, with the high-frequency components of local video features that focus on shorter subsequences of frames. This approach maintains global consistency while incorporating diverse and high-quality spatiotemporal details from local videos, enhancing both the consistency and fidelity of long video generation. We evaluated FreeLong on multiple base video diffusion models and observed significant improvements. Additionally, our method supports coherent multi-prompt generation, ensuring both visual coherence and seamless transitions between scenes.

FreeLong: スペクトラルブレンド時系列アテンションを用いたトレーニング不要の長尺動画生成

FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention

要旨

Support