VEnhancer：视频生成的生成式时空增强

摘要

我们提出了VEnhancer，这是一个生成式时空增强框架，通过在空间域添加更多细节和在时间域合成详细运动来改善现有的文本到视频结果。给定一个生成的低质量视频，我们的方法可以通过统一的视频扩散模型同时增加其空间和时间分辨率，实现任意上采样空间和时间尺度。此外，VEnhancer有效消除了生成视频的空间伪影和时间闪烁。为实现这一目标，我们基于预训练的视频扩散模型，训练了一个视频控制网络，并将其注入到扩散模型中作为低帧率和低分辨率视频的条件。为了有效训练这个视频控制网络，我们设计了时空数据增强以及视频感知调节。得益于以上设计，VEnhancer在训练过程中表现稳定，并采用了优雅的端到端训练方式。大量实验证明，VEnhancer在增强AI生成视频方面超越了现有的视频超分辨率和时空超分辨率方法。此外，借助VEnhancer，现有的开源最先进文本到视频方法VideoCrafter-2在视频生成基准VBench中排名第一。

English

We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

VEnhancer：视频生成的生成式时空增强

VEnhancer: Generative Space-Time Enhancement for Video Generation

摘要

Support