保留自身的關聯性:視訊擴散模型的噪聲先驗
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
May 17, 2023
作者: Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji
cs.AI
摘要
儘管擴散模型在生成高質量圖像方面取得了巨大進展,但合成一系列既逼真又時間上連貫的動畫幀仍處於起步階段。儘管可以使用現成的十億級數據集進行圖像生成,但收集相同規模的視頻數據仍然具有挑戰性。此外,訓練視頻擴散模型的計算成本遠高於其圖像對應物。在這項工作中,我們探索了使用視頻數據對預訓練圖像擴散模型進行微調作為視頻合成任務的實際解決方案。我們發現,將圖像噪聲先驗直接擴展到視頻擴散中導致次優性能。我們精心設計的視頻噪聲先驗則實現了顯著更好的性能。廣泛的實驗驗證顯示,我們的模型「保留自身相關性」(PYoCo)在UCF-101和MSR-VTT基準測試中實現了SOTA零樣本文本到視頻的結果。它還在小規模UCF-101基準測試中實現了SOTA視頻生成質量,使用比先前方法少10倍的模型並顯著減少計算量。
English
Despite tremendous progress in generating high-quality images using diffusion
models, synthesizing a sequence of animated frames that are both photorealistic
and temporally coherent is still in its infancy. While off-the-shelf
billion-scale datasets for image generation are available, collecting similar
video data of the same scale is still challenging. Also, training a video
diffusion model is computationally much more expensive than its image
counterpart. In this work, we explore finetuning a pretrained image diffusion
model with video data as a practical solution for the video synthesis task. We
find that naively extending the image noise prior to video noise prior in video
diffusion leads to sub-optimal performance. Our carefully designed video noise
prior leads to substantially better performance. Extensive experimental
validation shows that our model, Preserve Your Own Correlation (PYoCo), attains
SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It
also achieves SOTA video generation quality on the small-scale UCF-101
benchmark with a 10times smaller model using significantly less computation
than the prior art.