VEnhancer:用於視頻生成的生成式時空增強
VEnhancer: Generative Space-Time Enhancement for Video Generation
July 10, 2024
作者: Jingwen He, Tianfan Xue, Dongyang Liu, Xinqi Lin, Peng Gao, Dahua Lin, Yu Qiao, Wanli Ouyang, Ziwei Liu
cs.AI
摘要
我們提出了 VEnhancer,一個生成式時空增強框架,通過在空間領域中添加更多細節和在時間領域中合成詳細運動,改進了現有的文本到視頻結果。給定一個生成的低質量視頻,我們的方法可以通過統一的視頻擴散模型同時增加其空間和時間分辨率,並具有任意的上採樣空間和時間尺度。此外,VEnhancer有效地消除了生成的視頻中的空間人工物和時間閃爍。為了實現這一目標,基於預訓練的視頻擴散模型,我們訓練了一個視頻 ControlNet,並將其注入到擴散模型中,作為低幀率和低分辨率視頻的條件。為了有效地訓練這個視頻 ControlNet,我們設計了時空數據增強以及視頻感知條件。由於上述設計的好處,VEnhancer 在訓練期間保持穩定,並具有優雅的端到端訓練方式。大量實驗表明,VEnhancer 在增強 AI 生成的視頻方面超越了現有的最先進視頻超分辨率和時空超分辨率方法。此外,憑藉 VEnhancer,現有的開源最先進文本到視頻方法 VideoCrafter-2 在視頻生成基準測試 VBench 中達到了第一名。
English
We present VEnhancer, a generative space-time enhancement framework that
improves the existing text-to-video results by adding more details in spatial
domain and synthetic detailed motion in temporal domain. Given a generated
low-quality video, our approach can increase its spatial and temporal
resolution simultaneously with arbitrary up-sampling space and time scales
through a unified video diffusion model. Furthermore, VEnhancer effectively
removes generated spatial artifacts and temporal flickering of generated
videos. To achieve this, basing on a pretrained video diffusion model, we train
a video ControlNet and inject it to the diffusion model as a condition on low
frame-rate and low-resolution videos. To effectively train this video
ControlNet, we design space-time data augmentation as well as video-aware
conditioning. Benefiting from the above designs, VEnhancer yields to be stable
during training and shares an elegant end-to-end training manner. Extensive
experiments show that VEnhancer surpasses existing state-of-the-art video
super-resolution and space-time super-resolution methods in enhancing
AI-generated videos. Moreover, with VEnhancer, exisiting open-source
state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in
video generation benchmark -- VBench.Summary
AI-Generated Summary