SF-V:单向视频生成模型
SF-V: Single Forward Video Generation Model
June 6, 2024
作者: Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren
cs.AI
摘要
基于扩散的视频生成模型已经展示出在通过迭代去噪过程获取高保真视频方面取得了显著成功。然而,这些模型在采样过程中需要多次去噪步骤,导致计算成本高昂。在这项工作中,我们提出了一种新颖的方法,通过利用对抗训练来微调预训练的视频扩散模型,以获得单步视频生成模型。我们展示通过对抗训练,多步视频扩散模型,即稳定视频扩散(SVD),可以被训练为执行单次前向传递以合成高质量视频,捕捉视频数据中的时间和空间依赖关系。大量实验证明,我们的方法实现了合成视频的竞争性生成质量,同时显著减少了去噪过程的计算开销(即与SVD相比加快了约23倍,与现有作品相比加快了6倍,生成质量更好),为实时视频合成和编辑铺平了道路。更多可视化结果可在https://snap-research.github.io/SF-V 上公开获取。
English
Diffusion-based video generation models have demonstrated remarkable success
in obtaining high-fidelity videos through the iterative denoising process.
However, these models require multiple denoising steps during sampling,
resulting in high computational costs. In this work, we propose a novel
approach to obtain single-step video generation models by leveraging
adversarial training to fine-tune pre-trained video diffusion models. We show
that, through the adversarial training, the multi-steps video diffusion model,
i.e., Stable Video Diffusion (SVD), can be trained to perform single forward
pass to synthesize high-quality videos, capturing both temporal and spatial
dependencies in the video data. Extensive experiments demonstrate that our
method achieves competitive generation quality of synthesized videos with
significantly reduced computational overhead for the denoising process (i.e.,
around 23times speedup compared with SVD and 6times speedup compared with
existing works, with even better generation quality), paving the way for
real-time video synthesis and editing. More visualization results are made
publicly available at https://snap-research.github.io/SF-V.Summary
AI-Generated Summary