FreeInit：弥合视频扩散模型中的初始化差距

摘要

尽管基于扩散的视频生成取得了快速进展，现有模型的推断结果仍然表现出令人不满意的时间一致性和不自然的动态。本文深入探讨了视频扩散模型的噪声初始化，并发现了导致推断质量不佳的隐含训练-推断差距。我们的关键发现是：1）推断时刻的初始潜变量的空间-时间频率分布与训练时 intrinsically 不同，2）去噪过程受初始噪声的低频分量显著影响。受这些观察的启发，我们提出了一种简洁而有效的推断采样策略，名为 FreeInit，显著改善了扩散模型生成的视频的时间一致性。通过在推断过程中迭代地优化初始潜变量的空间-时间低频分量，FreeInit 能够弥补训练和推断之间的初始化差距，从而有效改善生成结果的主体外观和时间一致性。大量实验证明，FreeInit 能够持续提升各种文本到视频生成模型的生成结果，而无需额外训练。

English

Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into the noise initialization of video diffusion models, and discover an implicit training-inference gap that attributes to the unsatisfactory inference quality. Our key findings are: 1) the spatial-temporal frequency distribution of the initial latent at inference is intrinsically different from that for training, and 2) the denoising process is significantly influenced by the low-frequency components of the initial noise. Motivated by these observations, we propose a concise yet effective inference sampling strategy, FreeInit, which significantly improves temporal consistency of videos generated by diffusion models. Through iteratively refining the spatial-temporal low-frequency components of the initial latent during inference, FreeInit is able to compensate the initialization gap between training and inference, thus effectively improving the subject appearance and temporal consistency of generation results. Extensive experiments demonstrate that FreeInit consistently enhances the generation results of various text-to-video generation models without additional training.

FreeInit：弥合视频扩散模型中的初始化差距

FreeInit: Bridging Initialization Gap in Video Diffusion Models

摘要

Support