MagicVideo-V2: マルチステージ高品質ビデオ生成

要旨

テキスト記述からの高精細な動画生成に対する需要の高まりが、この分野における重要な研究を促進しています。本研究では、テキストから画像を生成するモデル、動画モーション生成器、参照画像埋め込みモジュール、およびフレーム補間モジュールを統合したエンドツーエンドの動画生成パイプラインであるMagicVideo-V2を紹介します。これらのアーキテクチャ設計により、MagicVideo-V2は美しく、高解像度で、驚くべき忠実度と滑らかさを備えた動画を生成することができます。大規模なユーザー評価を通じて、Runway、Pika 1.0、Morph、Moon Valley、Stable Video Diffusionモデルなどの主要なテキストから動画を生成するシステムを凌駕する優れた性能を示しています。

English

The growing demand for high-fidelity video generation from textual descriptions has catalyzed significant research in this field. In this work, we introduce MagicVideo-V2 that integrates the text-to-image model, video motion generator, reference image embedding module and frame interpolation module into an end-to-end video generation pipeline. Benefiting from these architecture designs, MagicVideo-V2 can generate an aesthetically pleasing, high-resolution video with remarkable fidelity and smoothness. It demonstrates superior performance over leading Text-to-Video systems such as Runway, Pika 1.0, Morph, Moon Valley and Stable Video Diffusion model via user evaluation at large scale.

MagicVideo-V2: マルチステージ高品質ビデオ生成

MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation

要旨

Support