VEnhancer:视频生成的生成式时空增强
VEnhancer: Generative Space-Time Enhancement for Video Generation
July 10, 2024
作者: Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, Ziwei Liu
cs.AI
摘要
我们提出了VEnhancer,这是一个生成式时空增强框架,通过在空间域添加更多细节和在时间域合成详细运动来改善现有的文本到视频结果。给定一个生成的低质量视频,我们的方法可以通过统一的视频扩散模型同时增加其空间和时间分辨率,实现任意上采样空间和时间尺度。此外,VEnhancer有效消除了生成视频的空间伪影和时间闪烁。为实现这一目标,我们基于预训练的视频扩散模型,训练了一个视频控制网络,并将其注入到扩散模型中作为低帧率和低分辨率视频的条件。为了有效训练这个视频控制网络,我们设计了时空数据增强以及视频感知调节。得益于以上设计,VEnhancer在训练过程中表现稳定,并采用了优雅的端到端训练方式。大量实验证明,VEnhancer在增强AI生成视频方面超越了现有的视频超分辨率和时空超分辨率方法。此外,借助VEnhancer,现有的开源最先进文本到视频方法VideoCrafter-2在视频生成基准VBench中排名第一。
English
We present VEnhancer, a generative space-time enhancement framework that
improves the existing text-to-video results by adding more details in spatial
domain and synthetic detailed motion in temporal domain. Given a generated
low-quality video, our approach can increase its spatial and temporal
resolution simultaneously with arbitrary up-sampling space and time scales
through a unified video diffusion model. Furthermore, VEnhancer effectively
removes generated spatial artifacts and temporal flickering of generated
videos. To achieve this, basing on a pretrained video diffusion model, we train
a video ControlNet and inject it to the diffusion model as a condition on low
frame-rate and low-resolution videos. To effectively train this video
ControlNet, we design space-time data augmentation as well as video-aware
conditioning. Benefiting from the above designs, VEnhancer yields to be stable
during training and shares an elegant end-to-end training manner. Extensive
experiments show that VEnhancer surpasses existing state-of-the-art video
super-resolution and space-time super-resolution methods in enhancing
AI-generated videos. Moreover, with VEnhancer, exisiting open-source
state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in
video generation benchmark -- VBench.Summary
AI-Generated Summary