ExVideo: パラメータ効率型ポストチューニングによるビデオ拡散モデルの拡張

要旨

近年、ビデオ合成技術の進展が大きな注目を集めています。AnimateDiffやStable Video Diffusionなどのビデオ合成モデルは、拡散モデルを用いて動的な視覚コンテンツを生成する実用性を実証しました。SORAの登場は、ビデオ生成技術の可能性をさらに際立たせています。しかし、ビデオの長さの拡張は、計算リソースの制約によって制限されてきました。既存のビデオ合成モデルの多くは、短いビデオクリップしか生成できません。本論文では、ビデオ合成モデルのための新しいポストチューニング手法であるExVideoを提案します。このアプローチは、現在のビデオ合成モデルの能力を向上させ、より長い時間にわたるコンテンツを生成することを可能にしつつ、トレーニングコストを低減することを目的としています。特に、3D畳み込み、時間的アテンション、位置埋め込みといった一般的な時間的モデルアーキテクチャに対して、それぞれ拡張戦略を設計しました。提案したポストチューニングアプローチの有効性を評価するため、Stable Video Diffusionモデルに対して拡張トレーニングを実施しました。このアプローチにより、モデルは元のフレーム数の5倍まで生成する能力を獲得し、40kのビデオを含むデータセットでわずか1.5k GPU時間のトレーニングを要しました。重要なことに、ビデオの長さの大幅な増加は、モデルの本来の汎化能力を損なうことなく、多様なスタイルや解像度のビデオを生成する際にその利点を発揮します。ソースコードと拡張モデルを公開する予定です。

English

Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to 5times its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

ExVideo: パラメータ効率型ポストチューニングによるビデオ拡散モデルの拡張

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

要旨

Support