视频电梯:利用多功能文本到图像扩散模型提升视频生成质量
VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models
March 8, 2024
作者: Yabo Zhang, Yuxiang Wei, Xianhui Lin, Zheng Hui, Peiran Ren, Xuansong Xie, Xiangyang Ji, Wangmeng Zuo
cs.AI
摘要
文本到图像扩散模型(T2I)展示了在创建逼真和美学图像方面的前所未有能力。相比之下,文本到视频扩散模型(T2V)在帧质量和文本对齐方面仍然远远落后,这归因于训练视频的质量和数量不足。在本文中,我们介绍了VideoElevator,这是一种无需训练且即插即用的方法,利用T2I的卓越能力提升了T2V的性能。与传统的T2V采样(即时间和空间建模)不同,VideoElevator明确将每个采样步骤分解为时间运动细化和空间质量提升。具体而言,时间运动细化利用封装的T2V来增强时间一致性,然后反转为T2I所需的噪声分布。然后,空间质量提升利用膨胀的T2I直接预测更少噪声的潜在值,增加更多照片逼真的细节。我们在各种T2V和T2I的组合下进行了广泛的实验。结果显示,VideoElevator不仅改善了具有基础T2I的T2V基线的性能,还促进了具有个性化T2I的风格化视频合成。我们的代码可在https://github.com/YBYBZhang/VideoElevator找到。
English
Text-to-image diffusion models (T2I) have demonstrated unprecedented
capabilities in creating realistic and aesthetic images. On the contrary,
text-to-video diffusion models (T2V) still lag far behind in frame quality and
text alignment, owing to insufficient quality and quantity of training videos.
In this paper, we introduce VideoElevator, a training-free and plug-and-play
method, which elevates the performance of T2V using superior capabilities of
T2I. Different from conventional T2V sampling (i.e., temporal and spatial
modeling), VideoElevator explicitly decomposes each sampling step into temporal
motion refining and spatial quality elevating. Specifically, temporal motion
refining uses encapsulated T2V to enhance temporal consistency, followed by
inverting to the noise distribution required by T2I. Then, spatial quality
elevating harnesses inflated T2I to directly predict less noisy latent, adding
more photo-realistic details. We have conducted experiments in extensive
prompts under the combination of various T2V and T2I. The results show that
VideoElevator not only improves the performance of T2V baselines with
foundational T2I, but also facilitates stylistic video synthesis with
personalized T2I. Our code is available at
https://github.com/YBYBZhang/VideoElevator.