保留自身的相关性：视频扩散模型的噪声先验

摘要

尽管扩散模型在生成高质量图像方面取得了巨大进展，但合成一系列既逼真又在时间上连贯的动画帧仍处于起步阶段。虽然可以使用现成的十亿级图像数据集进行图像生成，但收集同等规模的视频数据仍具挑战性。此外，训练视频扩散模型的计算成本远高于其图像对应模型。在本研究中，我们探讨了使用视频数据对预训练图像扩散模型进行微调作为视频合成任务的实际解决方案。我们发现，简单地将图像噪声先验扩展到视频噪声先验会导致次优性能。我们精心设计的视频噪声先验带来了显著更好的性能。大量实验证明，我们的模型“保留自身相关性”（PYoCo）在UCF-101和MSR-VTT基准测试中实现了零样本文本到视频结果的最先进水平。它还在小规模UCF-101基准测试中以比以往更少的计算量使用10倍较小的模型，实现了最先进的视频生成质量。

English

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10times smaller model using significantly less computation than the prior art.

保留自身的相关性：视频扩散模型的噪声先验

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

摘要

Support