자신의 상관관계를 보존하라: 비디오 확산 모델을 위한 노이즈 사전

초록

확산 모델을 사용하여 고품질 이미지를 생성하는 데 있어 엄청난 진전이 있었음에도 불구하고, 사실적이면서도 시간적으로 일관된 애니메이션 프레임 시퀀스를 합성하는 기술은 아직 초기 단계에 머물러 있습니다. 이미지 생성을 위한 수십억 규모의 데이터셋은 쉽게 구할 수 있지만, 동일한 규모의 비디오 데이터를 수집하는 것은 여전히 어려운 과제입니다. 또한, 비디오 확산 모델을 학습시키는 것은 이미지 모델에 비해 훨씬 더 많은 계산 비용을 요구합니다. 본 연구에서는 비디오 합성 작업을 위해 사전 학습된 이미지 확산 모델을 비디오 데이터로 미세 조정하는 실용적인 해결책을 탐구합니다. 우리는 비디오 확산에서 이미지 노이즈 사전을 비디오 노이즈 사전으로 단순히 확장하는 것이 최적의 성능을 내지 못한다는 것을 발견했습니다. 우리가 신중하게 설계한 비디오 노이즈 사전은 훨씬 더 나은 성능을 보여줍니다. 광범위한 실험 검증을 통해 우리의 모델인 Preserve Your Own Correlation(PYoCo)이 UCF-101 및 MSR-VTT 벤치마크에서 SOTA(State-of-the-Art) 제로샷 텍스트-투-비디오 결과를 달성함을 보여줍니다. 또한, 이 모델은 소규모 UCF-101 벤치마크에서 기존 기술보다 10배 더 작은 모델과 상당히 적은 계산량으로 SOTA 비디오 생성 품질을 달성합니다.

English

Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10times smaller model using significantly less computation than the prior art.

자신의 상관관계를 보존하라: 비디오 확산 모델을 위한 노이즈 사전

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

초록

Support