一致性增强:用于图像到视频生成的视觉一致性增ancement
ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation
February 6, 2024
作者: Weiming Ren, Harry Yang, Ge Zhang, Cong Wei, Xinrun Du, Stephen Huang, Wenhu Chen
cs.AI
摘要
图像到视频(I2V)生成旨在利用初始帧(以及文本提示)创建视频序列。I2V生成中的一个重大挑战是在整个视频中保持视觉一致性:现有方法常常难以保持主题、背景和风格从第一帧开始的完整性,并确保视频叙事中的流畅和逻辑连贯性。为了缓解这些问题,我们提出了ConsistI2V,这是一种基于扩散的方法,用于增强I2V生成的视觉一致性。具体而言,我们引入了(1)对第一帧的时空注意力,以保持空间和运动一致性,(2)从第一帧的低频带进行噪声初始化,以增强布局一致性。这两种方法使ConsistI2V能够生成高度一致的视频。我们还将所提出的方法扩展到展示它们在自回归长视频生成和摄像机运动控制中改善一致性的潜力。为验证我们方法的有效性,我们提出了I2V-Bench,这是一个用于I2V生成的全面评估基准。我们的自动和人工评估结果表明ConsistI2V优于现有方法。
English
Image-to-video (I2V) generation aims to use the initial frame (alongside a
text prompt) to create a video sequence. A grand challenge in I2V generation is
to maintain visual consistency throughout the video: existing methods often
struggle to preserve the integrity of the subject, background, and style from
the first frame, as well as ensure a fluid and logical progression within the
video narrative. To mitigate these issues, we propose ConsistI2V, a
diffusion-based method to enhance visual consistency for I2V generation.
Specifically, we introduce (1) spatiotemporal attention over the first frame to
maintain spatial and motion consistency, (2) noise initialization from the
low-frequency band of the first frame to enhance layout consistency. These two
approaches enable ConsistI2V to generate highly consistent videos. We also
extend the proposed approaches to show their potential to improve consistency
in auto-regressive long video generation and camera motion control. To verify
the effectiveness of our method, we propose I2V-Bench, a comprehensive
evaluation benchmark for I2V generation. Our automatic and human evaluation
results demonstrate the superiority of ConsistI2V over existing methods.