OSV:一步足以实现高质量图像到视频的生成
OSV: One Step is Enough for High-Quality Image to Video Generation
September 17, 2024
作者: Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang
cs.AI
摘要
视频扩散模型展现出在生成高质量视频方面的巨大潜力,因此成为越来越受关注的焦点。然而,由于其固有的迭代特性,导致了巨大的计算和时间成本。尽管已经做出努力加速视频扩散,如通过一致性蒸馏等技术减少推断步骤和 GAN 训练,但这些方法往往在性能或训练稳定性方面存在不足。在本研究中,我们引入了一个两阶段训练框架,有效地将一致性蒸馏与 GAN 训练相结合,以解决这些挑战。此外,我们提出了一种新颖的视频鉴别器设计,消除了解码视频潜变量的需要,并提高了最终性能。我们的模型能够仅通过一步即生成高质量视频,并具有进行多步细化以进一步提高性能的灵活性。我们在 OpenWebVid-1M 基准上的定量评估显示,我们的模型明显优于现有方法。值得注意的是,我们的一步性能(FVD 171.15)超过了基于一致性蒸馏的方法 AnimateLCM 的 8 步性能(FVD 184.79),并接近先进的 Stable Video Diffusion 的 25 步性能(FVD 156.94)。
English
Video diffusion models have shown great potential in generating high-quality
videos, making them an increasingly popular focus. However, their inherent
iterative nature leads to substantial computational and time costs. While
efforts have been made to accelerate video diffusion by reducing inference
steps (through techniques like consistency distillation) and GAN training
(these approaches often fall short in either performance or training
stability). In this work, we introduce a two-stage training framework that
effectively combines consistency distillation with GAN training to address
these challenges. Additionally, we propose a novel video discriminator design,
which eliminates the need for decoding the video latents and improves the final
performance. Our model is capable of producing high-quality videos in merely
one-step, with the flexibility to perform multi-step refinement for further
performance enhancement. Our quantitative evaluation on the OpenWebVid-1M
benchmark shows that our model significantly outperforms existing methods.
Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of
the consistency distillation based method, AnimateLCM (FVD 184.79), and
approaches the 25-step performance of advanced Stable Video Diffusion (FVD
156.94).Summary
AI-Generated Summary