I2V-Adapter:一個針對影片擴散模型的通用影像轉影片適配器
I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models
December 27, 2023
作者: Xun Guo, Mingwu Zheng, Liang Hou, Yuan Gao, Yufan Deng, Chongyang Ma, Weiming Hu, Zhengjun Zha, Haibin Huang, Pengfei Wan, Di Zhang
cs.AI
摘要
在快速演進的數位內容生成領域中,焦點已從文本轉圖像(T2I)模型轉向更先進的視頻擴散模型,特別是文本轉視頻(T2V)和圖像轉視頻(I2V)。本文探討了I2V提出的複雜挑戰:將靜態圖像轉換為動態、逼真的視頻序列,同時保留原始圖像的保真度。傳統方法通常涉及將整個圖像集成到擴散過程中,或使用預訓練的編碼器進行交叉關注。然而,這些方法通常需要改變T2I模型的基本權重,從而限制了它們的可重用性。我們提出了一種新穎的解決方案,即I2V-Adapter,旨在克服這些限制。我們的方法保留了T2I模型及其固有運動模組的結構完整性。I2V-Adapter通過並行處理帶有噪聲的視頻幀和輸入圖像,利用輕量級的適配器模塊運作。該模塊充當橋樑,有效地將輸入與模型的自注意機制相連接,從而在不需要對T2I模型進行結構更改的情況下保持空間細節。此外,I2V-Adapter僅需要傳統模型的一小部分參數,並確保與現有社區驅動的T2I模型和控制工具兼容。我們的實驗結果展示了I2V-Adapter生成高質量視頻輸出的能力。這種性能,加上其多功能性和對可訓練參數需求的降低,代表了人工智慧驅動視頻生成領域的重大進步,特別是對於創意應用。
English
In the rapidly evolving domain of digital content generation, the focus has
shifted from text-to-image (T2I) models to more advanced video diffusion
models, notably text-to-video (T2V) and image-to-video (I2V). This paper
addresses the intricate challenge posed by I2V: converting static images into
dynamic, lifelike video sequences while preserving the original image fidelity.
Traditional methods typically involve integrating entire images into diffusion
processes or using pretrained encoders for cross attention. However, these
approaches often necessitate altering the fundamental weights of T2I models,
thereby restricting their reusability. We introduce a novel solution, namely
I2V-Adapter, designed to overcome such limitations. Our approach preserves the
structural integrity of T2I models and their inherent motion modules. The
I2V-Adapter operates by processing noised video frames in parallel with the
input image, utilizing a lightweight adapter module. This module acts as a
bridge, efficiently linking the input to the model's self-attention mechanism,
thus maintaining spatial details without requiring structural changes to the
T2I model. Moreover, I2V-Adapter requires only a fraction of the parameters of
conventional models and ensures compatibility with existing community-driven
T2I models and controlling tools. Our experimental results demonstrate
I2V-Adapter's capability to produce high-quality video outputs. This
performance, coupled with its versatility and reduced need for trainable
parameters, represents a substantial advancement in the field of AI-driven
video generation, particularly for creative applications.