I2V-Adapter：用于视频扩散模型的通用图像到视频适配器

摘要

在快速发展的数字内容生成领域，焦点已经从文本到图像（T2I）模型转向更先进的视频扩散模型，特别是文本到视频（T2V）和图像到视频（I2V）。本文探讨了I2V提出的复杂挑战：将静态图像转换为动态、栩栩如生的视频序列，同时保持原始图像的保真度。传统方法通常涉及将整个图像整合到扩散过程中或使用预训练的编码器进行交叉注意力。然而，这些方法通常需要改变T2I模型的基本权重，从而限制了它们的可重用性。我们引入了一种新颖的解决方案，即I2V-Adapter，旨在克服这些限制。我们的方法保留了T2I模型及其固有运动模块的结构完整性。I2V-Adapter通过并行处理带有噪声的视频帧和输入图像，利用轻量级适配器模块运行。该模块充当桥梁，有效地将输入与模型的自注意力机制连接起来，从而在不需要对T2I模型进行结构更改的情况下保持空间细节。此外，I2V-Adapter仅需要传统模型参数的一小部分，并确保与现有社区驱动的T2I模型和控制工具兼容。我们的实验结果展示了I2V-Adapter生成高质量视频输出的能力。这种性能，加上其多功能性和对可训练参数需求的降低，在AI驱动视频生成领域，特别是创意应用方面，代表了实质性的进步。

English

In the rapidly evolving domain of digital content generation, the focus has shifted from text-to-image (T2I) models to more advanced video diffusion models, notably text-to-video (T2V) and image-to-video (I2V). This paper addresses the intricate challenge posed by I2V: converting static images into dynamic, lifelike video sequences while preserving the original image fidelity. Traditional methods typically involve integrating entire images into diffusion processes or using pretrained encoders for cross attention. However, these approaches often necessitate altering the fundamental weights of T2I models, thereby restricting their reusability. We introduce a novel solution, namely I2V-Adapter, designed to overcome such limitations. Our approach preserves the structural integrity of T2I models and their inherent motion modules. The I2V-Adapter operates by processing noised video frames in parallel with the input image, utilizing a lightweight adapter module. This module acts as a bridge, efficiently linking the input to the model's self-attention mechanism, thus maintaining spatial details without requiring structural changes to the T2I model. Moreover, I2V-Adapter requires only a fraction of the parameters of conventional models and ensures compatibility with existing community-driven T2I models and controlling tools. Our experimental results demonstrate I2V-Adapter's capability to produce high-quality video outputs. This performance, coupled with its versatility and reduced need for trainable parameters, represents a substantial advancement in the field of AI-driven video generation, particularly for creative applications.

I2V-Adapter：用于视频扩散模型的通用图像到视频适配器

I2V-Adapter: A General Image-to-Video Adapter for Video Diffusion Models

摘要

Support