マジック1対1：1分以内で1分のビデオクリップを生成する

要旨

この技術レポートでは、最適化されたメモリ消費量と推論レイテンシーを持つ効率的なビデオ生成モデルであるMagic 1-For-1（Magic141）を紹介します。主要なアイデアは単純です：テキストからビデオを生成するタスクを、拡散ステップ蒸留のために2つの別々のより簡単なタスク、つまりテキストから画像生成と画像からビデオ生成に分解することです。同じ最適化アルゴリズムを使用して、画像からビデオへのタスクが実際にテキストからビデオへのタスクよりも収束しやすいことを確認します。また、画像からビデオ（I2V）モデルのトレーニングの計算コストを削減するための最適化トリックの一握りを探求します：1）マルチモーダル事前条件のインジェクションを使用してモデルの収束速度を加速化することによるモデルの収束速度の向上；2）敵対的なステップ蒸留を適用することによる推論レイテンシーの向上；および3）パラメータの疎な化による推論メモリコストの最適化。これらの技術を用いることで、3秒以内に5秒のビデオクリップを生成することができます。テスト時のスライディングウィンドウを適用することで、平均で1秒のビデオクリップを生成するのに1秒未満を要することで、1分間のビデオを1分以内に生成し、視覚的品質と動きのダイナミクスが大幅に向上します。拡散ステップ蒸留中の計算コストとビデオ品質の最適なトレードオフを見つけるための一連の予備的な探索を行い、これがオープンソースの探索のための良い基礎モデルになることを期待しています。コードとモデルの重みは、https://github.com/DA-Group-PKU/Magic-1-For-1 で入手可能です。

English

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

マジック1対1：1分以内で1分のビデオクリップを生成する

Magic 1-For-1: Generating One Minute Video Clips within One Minute

要旨

Support