I2VGen-XL：通过级联扩散模型实现高质量图像到视频合成

摘要

最近，视频合成在扩散模型的快速发展带来的显著进展。然而，它仍然面临着语义准确性、清晰度和时空连续性方面的挑战。这些挑战主要源自文本视频数据匮乏以及视频复杂的固有结构，使模型难以同时确保语义和质量上的卓越表现。在本报告中，我们提出了一种级联I2VGen-XL方法，通过解耦这两个因素并利用静态图像作为重要指导形式，增强模型性能并确保输入数据的对齐。I2VGen-XL包括两个阶段：i) 基础阶段通过使用两个分层编码器保证连贯的语义，并保留来自输入图像的内容，ii) 优化阶段通过整合额外简短文本来增强视频的细节，并将分辨率提高到1280x720。为了提高多样性，我们收集了约3500万个单镜头文本视频对和60亿个文本图像对来优化模型。通过这种方式，I2VGen-XL能够同时提高语义准确性、细节连续性和生成视频的清晰度。通过广泛实验，我们研究了I2VGen-XL的基本原理，并将其与当前顶尖方法进行了比较，证明了其在多样数据上的有效性。源代码和模型将在https://i2vgen-xl.github.io 上公开提供。

English

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280times720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at https://i2vgen-xl.github.io.

I2VGen-XL：通过级联扩散模型实现高质量图像到视频合成

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

摘要

Support