ChatPaper.aiChatPaper

SF-V:單向視訊生成模型

SF-V: Single Forward Video Generation Model

June 6, 2024
作者: Zhixing Zhang, Yanyu Li, Yushu Wu, Yanwu Xu, Anil Kag, Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, Junli Cao, Dimitris Metaxas, Sergey Tulyakov, Jian Ren
cs.AI

摘要

基於擴散的視頻生成模型已經展示出在通過迭代去噪過程中獲得高保真度視頻的顯著成功。然而,這些模型在抽樣過程中需要多個去噪步驟,導致高計算成本。在這項工作中,我們提出了一種新方法,通過利用對抗訓練來微調預訓練的視頻擴散模型,以獲得單步視頻生成模型。我們展示通過對抗訓練,多步視頻擴散模型,即穩定視頻擴散(SVD),可以被訓練以執行單次前向傳遞以合成高質量視頻,捕捉視頻數據中的時間和空間依賴性。大量實驗表明,我們的方法實現了與明顯降低的計算開銷相競爭的合成視頻生成質量(即與SVD相比加速約23倍,與現有作品相比加速約6倍,甚至具有更好的生成質量),為實時視頻合成和編輯鋪平了道路。更多可視化結果可在https://snap-research.github.io/SF-V 公開獲得。
English
Diffusion-based video generation models have demonstrated remarkable success in obtaining high-fidelity videos through the iterative denoising process. However, these models require multiple denoising steps during sampling, resulting in high computational costs. In this work, we propose a novel approach to obtain single-step video generation models by leveraging adversarial training to fine-tune pre-trained video diffusion models. We show that, through the adversarial training, the multi-steps video diffusion model, i.e., Stable Video Diffusion (SVD), can be trained to perform single forward pass to synthesize high-quality videos, capturing both temporal and spatial dependencies in the video data. Extensive experiments demonstrate that our method achieves competitive generation quality of synthesized videos with significantly reduced computational overhead for the denoising process (i.e., around 23times speedup compared with SVD and 6times speedup compared with existing works, with even better generation quality), paving the way for real-time video synthesis and editing. More visualization results are made publicly available at https://snap-research.github.io/SF-V.

Summary

AI-Generated Summary

PDF262December 8, 2024