魔法1對1：在一分鐘內生成一分鐘的視頻片段

摘要

在這份技術報告中，我們介紹了Magic 1-For-1（Magic141），這是一個具有優化記憶體消耗和推論延遲的高效視頻生成模型。其關鍵思想很簡單：將文本到視頻生成任務分解為兩個獨立且更容易的擴散步驟蒸餾任務，即文本到圖像生成和圖像到視頻生成。我們證實，使用相同的優化算法，圖像到視頻任務確實比文本到視頻任務更容易收斂。我們還探索了一系列優化技巧，以降低訓練圖像到視頻（I2V）模型的計算成本，包括：1）通過使用多模態先驗條件注入來加快模型收斂速度；2）通過應用對抗式擴散步驟蒸餾來加快推論延遲；3）通過參數稀疏化來優化推論記憶體成本。憑藉這些技術，我們能夠在3秒內生成5秒的視頻片段。通過應用測試時間滑動窗口，我們能夠在一分鐘內生成一分鐘長的視頻，視覺質量和運動動態顯著提升，平均花費不到1秒的時間來生成1秒的視頻片段。我們進行了一系列初步探索，以找出在擴散步驟蒸餾期間計算成本和視頻質量之間的最佳折衷方案，並希望這可以成為開源探索的良好基礎模型。代碼和模型權重可在https://github.com/DA-Group-PKU/Magic-1-For-1找到。

English

In this technical report, we present Magic 1-For-1 (Magic141), an efficient video generation model with optimized memory consumption and inference latency. The key idea is simple: factorize the text-to-video generation task into two separate easier tasks for diffusion step distillation, namely text-to-image generation and image-to-video generation. We verify that with the same optimization algorithm, the image-to-video task is indeed easier to converge over the text-to-video task. We also explore a bag of optimization tricks to reduce the computational cost of training the image-to-video (I2V) models from three aspects: 1) model convergence speedup by using a multi-modal prior condition injection; 2) inference latency speed up by applying an adversarial step distillation, and 3) inference memory cost optimization with parameter sparsification. With those techniques, we are able to generate 5-second video clips within 3 seconds. By applying a test time sliding window, we are able to generate a minute-long video within one minute with significantly improved visual quality and motion dynamics, spending less than 1 second for generating 1 second video clips on average. We conduct a series of preliminary explorations to find out the optimal tradeoff between computational cost and video quality during diffusion step distillation and hope this could be a good foundation model for open-source explorations. The code and the model weights are available at https://github.com/DA-Group-PKU/Magic-1-For-1.

魔法1對1：在一分鐘內生成一分鐘的視頻片段

Magic 1-For-1: Generating One Minute Video Clips within One Minute

摘要

Support