FreeNoise: ノイズ再スケジューリングによるチューニング不要の長時間ビデオ拡散

要旨

大規模なビデオデータセットの利用可能性と拡散モデルの進展に伴い、テキスト駆動型のビデオ生成は大きな進歩を遂げています。しかし、既存のビデオ生成モデルは通常、限られたフレーム数で訓練されており、推論時に高忠実度の長尺ビデオを生成することができません。さらに、これらのモデルは単一のテキスト条件のみをサポートしており、現実のシナリオではビデオ内容が時間とともに変化するため、複数のテキスト条件が必要となることが多いです。これらの課題に対処するため、本研究では、複数のテキストに条件付けられた長尺ビデオを生成するためのテキスト駆動能力の拡張可能性を探ります。1) まず、ビデオ拡散モデルにおける初期ノイズの影響を分析します。次に、ノイズに関する観察に基づいて、事前訓練済みのビデオ拡散モデルの生成能力を向上させながら内容の一貫性を保つ、チューニング不要で時間効率の良いパラダイムであるFreeNoiseを提案します。具体的には、すべてのフレームに対してノイズを初期化する代わりに、長距離相関のためにノイズのシーケンスを再スケジュールし、ウィンドウベースの関数を用いてそれらに対して時間的注意を実行します。2) さらに、複数のテキストプロンプトに条件付けられたビデオの生成をサポートするための新しいモーション注入方法を設計します。広範な実験により、ビデオ拡散モデルの生成能力を拡張するための我々のパラダイムの優位性が検証されました。特に、従来の最良の手法が255%の追加時間コストをもたらしたのに対し、我々の方法は約17%の無視できる時間コストしか発生しないことが注目に値します。生成されたビデオサンプルは、当社のウェブサイト（http://haonanqiu.com/projects/FreeNoise.html）でご覧いただけます。

English

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

FreeNoise: ノイズ再スケジューリングによるチューニング不要の長時間ビデオ拡散

FreeNoise: Tuning-Free Longer Video Diffusion Via Noise Rescheduling

要旨

Support