VideoCrafter1：開放式擴散模型用於高品質視頻生成

摘要

視頻生成在學術界和工業界越來越受到關注。儘管商業工具可以生成合理的視頻，但對於研究人員和工程師來說，可用的開源模型數量有限。在這項工作中，我們介紹了兩種用於高質量視頻生成的擴散模型，即文本到視頻（T2V）和圖像到視頻（I2V）模型。T2V 模型根據給定的文本輸入合成視頻，而 I2V 模型則包含額外的圖像輸入。我們提出的 T2V 模型可以生成分辨率為 1024x576 的逼真且具有電影質量的視頻，在質量方面優於其他開源的 T2V 模型。I2V 模型旨在生成嚴格遵循所提供參考圖像內容的視頻，保留其內容、結構和風格。該模型是第一個開源的 I2V 基礎模型，能夠將給定圖像轉換為視頻片段，同時保持內容保留約束。我們相信這些開源視頻生成模型將對社區內的技術進步做出重大貢獻。

English

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of 1024 times 576, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

VideoCrafter1：開放式擴散模型用於高品質視頻生成

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

摘要

Support