VEnhancer：用於視頻生成的生成式時空增強

摘要

我們提出了 VEnhancer，一個生成式時空增強框架，通過在空間領域中添加更多細節和在時間領域中合成詳細運動，改進了現有的文本到視頻結果。給定一個生成的低質量視頻，我們的方法可以通過統一的視頻擴散模型同時增加其空間和時間分辨率，並具有任意的上採樣空間和時間尺度。此外，VEnhancer有效地消除了生成的視頻中的空間人工物和時間閃爍。為了實現這一目標，基於預訓練的視頻擴散模型，我們訓練了一個視頻 ControlNet，並將其注入到擴散模型中，作為低幀率和低分辨率視頻的條件。為了有效地訓練這個視頻 ControlNet，我們設計了時空數據增強以及視頻感知條件。由於上述設計的好處，VEnhancer 在訓練期間保持穩定，並具有優雅的端到端訓練方式。大量實驗表明，VEnhancer 在增強 AI 生成的視頻方面超越了現有的最先進視頻超分辨率和時空超分辨率方法。此外，憑藉 VEnhancer，現有的開源最先進文本到視頻方法 VideoCrafter-2 在視頻生成基準測試 VBench 中達到了第一名。

English

We present VEnhancer, a generative space-time enhancement framework that improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

VEnhancer：用於視頻生成的生成式時空增強

VEnhancer: Generative Space-Time Enhancement for Video Generation

摘要

Support