自身の相関を保持せよ：ビデオ拡散モデルのためのノイズ事前分布

要旨

拡散モデルを用いた高品質な画像生成において大きな進展があったにもかかわらず、フォトリアルで時間的に一貫性のあるアニメーションフレームのシーケンスを合成することは、まだ初期段階にあります。画像生成のための既存の大規模データセットは利用可能ですが、同じ規模のビデオデータを収集することは依然として困難です。また、ビデオ拡散モデルのトレーニングは、画像モデルに比べて計算コストがはるかに高くなります。本研究では、ビデオ合成タスクの実用的な解決策として、事前学習済みの画像拡散モデルをビデオデータでファインチューニングする方法を探ります。ビデオ拡散において、画像ノイズの事前分布を単純にビデオノイズの事前分布に拡張すると、最適な性能が得られないことがわかりました。私たちが慎重に設計したビデオノイズの事前分布は、大幅に優れた性能をもたらします。広範な実験的検証により、私たちのモデル「Preserve Your Own Correlation (PYoCo)」が、UCF-101およびMSR-VTTベンチマークにおいてSOTAのゼロショットテキスト・トゥ・ビデオ結果を達成することが示されました。また、小規模なUCF-101ベンチマークにおいて、従来の手法よりも10倍小さなモデルで、大幅に少ない計算量でSOTAのビデオ生成品質を達成しました。

English

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10times smaller model using significantly less computation than the prior art.

自身の相関を保持せよ：ビデオ拡散モデルのためのノイズ事前分布

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

要旨

Support