一致性增强：用于图像到视频生成的视觉一致性增ancement

摘要

图像到视频（I2V）生成旨在利用初始帧（以及文本提示）创建视频序列。I2V生成中的一个重大挑战是在整个视频中保持视觉一致性：现有方法常常难以保持主题、背景和风格从第一帧开始的完整性，并确保视频叙事中的流畅和逻辑连贯性。为了缓解这些问题，我们提出了ConsistI2V，这是一种基于扩散的方法，用于增强I2V生成的视觉一致性。具体而言，我们引入了（1）对第一帧的时空注意力，以保持空间和运动一致性，（2）从第一帧的低频带进行噪声初始化，以增强布局一致性。这两种方法使ConsistI2V能够生成高度一致的视频。我们还将所提出的方法扩展到展示它们在自回归长视频生成和摄像机运动控制中改善一致性的潜力。为验证我们方法的有效性，我们提出了I2V-Bench，这是一个用于I2V生成的全面评估基准。我们的自动和人工评估结果表明ConsistI2V优于现有方法。

English

Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.

一致性增强：用于图像到视频生成的视觉一致性增ancement

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

摘要

Support