渐进自回归视频扩散模型

摘要

当前前沿视频扩散模型展示了在生成高质量视频方面的显著成果。然而，由于训练过程中的计算限制，它们只能生成短视频片段，通常约为10秒或240帧。在这项工作中，我们展示了现有模型可以自然地扩展为自回归视频扩散模型，而无需改变架构。我们的关键思想是为潜在帧分配逐渐增加的噪声水平，而不是单一噪声水平，这允许在潜变量之间进行细粒度条件设置，并在注意力窗口之间产生大的重叠。这种渐进式视频去噪使我们的模型能够自回归生成视频帧，而不会出现质量下降或突然的场景变化。我们在长视频生成方面展示了最新的成果，达到了1分钟（24 FPS下的1440帧）。本文的视频可在https://desaixie.github.io/pa-vdm/上获取。

English

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. In this work, we show that existing models can be naturally extended to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level, which allows for fine-grained condition among the latents and large overlaps between the attention windows. Such progressive video denoising allows our models to autoregressively generate video frames without quality degradation or abrupt scene changes. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Videos from this paper are available at https://desaixie.github.io/pa-vdm/.

渐进自回归视频扩散模型

Progressive Autoregressive Video Diffusion Models

摘要

Support