I2VGen-XL：通過級聯擴散模型實現高質量的圖像到視頻合成

摘要

最近，由於擴散模型的快速發展，影片合成取得了顯著進展。然而，在語義準確性、清晰度和時空連續性方面仍然面臨挑戰。這些挑戰主要源於文本-影片數據匹配不足以及影片固有複雜結構，使模型難以同時確保語義和質量上的卓越。在本報告中，我們提出了一種分級I2VGen-XL方法，通過解耦這兩個因素並利用靜態圖像作為重要指導，提高模型性能並確保輸入數據的對齊。I2VGen-XL包括兩個階段：i）基礎階段通過使用兩個階層編碼器確保一致的語義並保留來自輸入圖像的內容，ii）精細化階段通過納入額外簡短文本來增強影片的細節並將分辨率提高至1280x720。為了提高多樣性，我們收集了約3500萬個單拍文本-影片對和60億個文本-圖像對來優化模型。通過這種方式，I2VGen-XL能夠同時提高語義準確性、細節的連續性和生成影片的清晰度。通過大量實驗，我們研究了I2VGen-XL的基本原則並將其與當前頂尖方法進行了比較，這可以展示其對多樣數據的有效性。源代碼和模型將公開在https://i2vgen-xl.github.io。

English

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity. They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages: i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280times720. To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos. Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available at https://i2vgen-xl.github.io.

I2VGen-XL：通過級聯擴散模型實現高質量的圖像到視頻合成

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

摘要

Support