ChatPaper.aiChatPaper

AccVideo:利用合成数据集加速视频扩散模型

AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset

March 25, 2025
作者: Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yunhong Wang, Yu Qiao
cs.AI

摘要

扩散模型在视频生成领域取得了显著进展。然而,其迭代去噪的特性需要大量推理步骤来生成视频,导致速度缓慢且计算成本高昂。本文首先详细分析了现有扩散蒸馏方法面临的挑战,并提出了一种新颖的高效方法——AccVideo,通过合成数据集减少推理步骤,从而加速视频扩散模型。我们利用预训练的视频扩散模型生成多条有效的去噪轨迹作为合成数据集,这消除了蒸馏过程中无用数据点的使用。基于该合成数据集,我们设计了一种基于轨迹的少步引导方法,利用去噪轨迹中的关键数据点学习噪声到视频的映射,从而在更少的步骤中生成视频。此外,由于合成数据集捕捉了每个扩散时间步的数据分布,我们引入了一种对抗训练策略,使学生模型的输出分布与合成数据集对齐,从而提升视频质量。大量实验表明,与教师模型相比,我们的模型在保持相当性能的同时,生成速度提升了8.5倍。与以往的加速方法相比,我们的方法能够生成更高质量和分辨率的视频,即5秒、720x1280、24帧每秒。
English
Diffusion models have achieved remarkable progress in the field of video generation. However, their iterative denoising nature requires a large number of inference steps to generate a video, which is slow and computationally expensive. In this paper, we begin with a detailed analysis of the challenges present in existing diffusion distillation methods and propose a novel efficient method, namely AccVideo, to reduce the inference steps for accelerating video diffusion models with synthetic dataset. We leverage the pretrained video diffusion model to generate multiple valid denoising trajectories as our synthetic dataset, which eliminates the use of useless data points during distillation. Based on the synthetic dataset, we design a trajectory-based few-step guidance that utilizes key data points from the denoising trajectories to learn the noise-to-video mapping, enabling video generation in fewer steps. Furthermore, since the synthetic dataset captures the data distribution at each diffusion timestep, we introduce an adversarial training strategy to align the output distribution of the student model with that of our synthetic dataset, thereby enhancing the video quality. Extensive experiments demonstrate that our model achieves 8.5x improvements in generation speed compared to the teacher model while maintaining comparable performance. Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution, i.e., 5-seconds, 720x1280, 24fps.

Summary

AI-Generated Summary

PDF102March 27, 2025