AccVideo：利用合成數據集加速視頻擴散模型

摘要

擴散模型在視頻生成領域取得了顯著進展。然而，其迭代去噪的特性需要大量推理步驟來生成視頻，這既緩慢又計算成本高昂。本文首先詳細分析了現有擴散蒸餾方法中的挑戰，並提出了一種新穎的高效方法——AccVideo，通過合成數據集減少推理步驟以加速視頻擴散模型。我們利用預訓練的視頻擴散模型生成多條有效的去噪軌跡作為我們的合成數據集，這在蒸餾過程中消除了無用數據點的使用。基於合成數據集，我們設計了一種基於軌跡的少步指導方法，利用去噪軌跡中的關鍵數據點來學習噪聲到視頻的映射，從而實現更少步驟的視頻生成。此外，由於合成數據集捕捉了每個擴散時間步的數據分佈，我們引入了一種對抗訓練策略，使學生模型的輸出分佈與我們的合成數據集對齊，從而提高視頻質量。大量實驗表明，與教師模型相比，我們的模型在生成速度上實現了8.5倍的提升，同時保持了相當的性能。與之前的加速方法相比，我們的方法能夠生成更高質量和分辨率的視頻，即5秒鐘、720x1280、24幀每秒。

English

Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.