ConsistI2V：增強圖像到視頻生成的視覺一致性

摘要

圖像轉視頻（I2V）生成旨在利用初始幀（以及文字提示）來創建視頻序列。在I2V生成中一個重大挑戰是在整個視頻中保持視覺一致性：現有方法常常難以保留主題、背景和風格的完整性，同時確保視頻敘事中的流暢和邏輯進展。為了緩解這些問題，我們提出ConsistI2V，一種基於擴散的方法，用於增強I2V生成的視覺一致性。具體而言，我們引入（1）對第一幀的時空注意力，以保持空間和運動一致性，（2）從第一幀的低頻帶進行噪聲初始化，以增強佈局一致性。這兩種方法使ConsistI2V能夠生成高度一致的視頻。我們還將所提出的方法擴展，展示它們在改善自回歸長視頻生成和相機運動控制中的一致性潛力。為驗證我們方法的有效性，我們提出了I2V-Bench，一個用於I2V生成的全面評估基準。我們的自動和人工評估結果顯示ConsistI2V優於現有方法。

English

Image-to-video (I2V) generation aims to use the initial frame (alongside a text prompt) to create a video sequence. A grand challenge in I2V generation is to maintain visual consistency throughout the video: existing methods often struggle to preserve the integrity of the subject, background, and style from the first frame, as well as ensure a fluid and logical progression within the video narrative. To mitigate these issues, we propose ConsistI2V, a diffusion-based method to enhance visual consistency for I2V generation. Specifically, we introduce (1) spatiotemporal attention over the first frame to maintain spatial and motion consistency, (2) noise initialization from the low-frequency band of the first frame to enhance layout consistency. These two approaches enable ConsistI2V to generate highly consistent videos. We also extend the proposed approaches to show their potential to improve consistency in auto-regressive long video generation and camera motion control. To verify the effectiveness of our method, we propose I2V-Bench, a comprehensive evaluation benchmark for I2V generation. Our automatic and human evaluation results demonstrate the superiority of ConsistI2V over existing methods.

ConsistI2V：增強圖像到視頻生成的視覺一致性

ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation

摘要

Support