I2V-Adapter: ビデオ拡散モデルのための汎用画像-ビデオアダプター

要旨

急速に進化するデジタルコンテンツ生成の領域において、焦点はテキストから画像（T2I）モデルから、より高度なビデオ拡散モデル、特にテキストからビデオ（T2V）および画像からビデオ（I2V）へと移行しています。本論文は、I2Vが提起する複雑な課題、すなわち静的な画像を動的でリアルなビデオシーケンスに変換しつつ、元の画像の忠実度を維持するという課題に取り組みます。従来の手法では、通常、画像全体を拡散プロセスに統合するか、事前学習済みエンコーダーを使用してクロスアテンションを行うことが一般的でした。しかし、これらのアプローチでは、T2Iモデルの基本的な重みを変更する必要があるため、その再利用性が制限されていました。我々は、このような制限を克服するための新しい解決策、すなわちI2V-Adapterを提案します。我々のアプローチは、T2Iモデルの構造的整合性とその内在するモーションモジュールを維持します。I2V-Adapterは、入力画像と並行してノイズの入ったビデオフレームを処理し、軽量なアダプターモジュールを利用することで動作します。このモジュールはブリッジとして機能し、入力とモデルの自己アテンションメカニズムを効率的に接続し、T2Iモデルの構造的変更を必要とせずに空間的詳細を維持します。さらに、I2V-Adapterは従来のモデルに比べてわずかなパラメータしか必要とせず、既存のコミュニティ主導のT2Iモデルや制御ツールとの互換性を確保します。我々の実験結果は、I2V-Adapterが高品質なビデオ出力を生成する能力を示しています。この性能とその汎用性、そして訓練可能なパラメータの削減は、特にクリエイティブなアプリケーションにおけるAI駆動のビデオ生成分野における大きな進歩を表しています。

English

In the rapidly evolving domain of digital content generation, the focus has shifted from text-to-image (T2I) models to more advanced video diffusion models, notably text-to-video (T2V) and image-to-video (I2V). This paper addresses the intricate challenge posed by I2V: converting static images into dynamic, lifelike video sequences while preserving the original image fidelity. Traditional methods typically involve integrating entire images into diffusion processes or using pretrained encoders for cross attention. However, these approaches often necessitate altering the fundamental weights of T2I models, thereby restricting their reusability. We introduce a novel solution, namely I2V-Adapter, designed to overcome such limitations. Our approach preserves the structural integrity of T2I models and their inherent motion modules. The I2V-Adapter operates by processing noised video frames in parallel with the input image, utilizing a lightweight adapter module. This module acts as a bridge, efficiently linking the input to the model's self-attention mechanism, thus maintaining spatial details without requiring structural changes to the T2I model. Moreover, I2V-Adapter requires only a fraction of the parameters of conventional models and ensures compatibility with existing community-driven T2I models and controlling tools. Our experimental results demonstrate I2V-Adapter's capability to produce high-quality video outputs. This performance, coupled with its versatility and reduced need for trainable parameters, represents a substantial advancement in the field of AI-driven video generation, particularly for creative applications.

I2V-Adapter: ビデオ拡散モデルのための汎用画像-ビデオアダプター

I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models

要旨

Support