VideoCrafter1: 高品質なビデオ生成のためのオープン拡散モデル

要旨

ビデオ生成は、学界と産業界の両方でますます注目を集めています。商用ツールは説得力のあるビデオを生成できますが、研究者やエンジニアが利用できるオープンソースモデルは限られています。本研究では、高品質なビデオ生成のための2つの拡散モデル、すなわちテキストからビデオ（T2V）モデルと画像からビデオ（I2V）モデルを紹介します。T2Vモデルは、与えられたテキスト入力に基づいてビデオを合成し、I2Vモデルは追加の画像入力を組み込みます。提案するT2Vモデルは、1024×576の解像度で現実的かつ映画品質のビデオを生成でき、品質の面で他のオープンソースT2Vモデルを上回ります。I2Vモデルは、提供された参照画像の内容、構造、スタイルを厳密に保持し、その内容に忠実なビデオを生成するように設計されています。このモデルは、与えられた画像をビデオクリップに変換しながら、内容保存の制約を維持する最初のオープンソースI2V基盤モデルです。これらのオープンソースビデオ生成モデルが、コミュニティ内の技術進歩に大きく貢献すると信じています。

English

Video generation has increasingly gained interest in both academia and industry. Although commercial tools can generate plausible videos, there is a limited number of open-source models available for researchers and engineers. In this work, we introduce two diffusion models for high-quality video generation, namely text-to-video (T2V) and image-to-video (I2V) models. T2V models synthesize a video based on a given text input, while I2V models incorporate an additional image input. Our proposed T2V model can generate realistic and cinematic-quality videos with a resolution of 1024 times 576, outperforming other open-source T2V models in terms of quality. The I2V model is designed to produce videos that strictly adhere to the content of the provided reference image, preserving its content, structure, and style. This model is the first open-source I2V foundation model capable of transforming a given image into a video clip while maintaining content preservation constraints. We believe that these open-source video generation models will contribute significantly to the technological advancements within the community.

VideoCrafter1: 高品質なビデオ生成のためのオープン拡散モデル

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

要旨

Support