渐进自回归视频扩散模型
Progressive Autoregressive Video Diffusion Models
October 10, 2024
作者: Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou
cs.AI
摘要
当前前沿视频扩散模型展示了在生成高质量视频方面的显著成果。然而,由于训练过程中的计算限制,它们只能生成短视频片段,通常约为10秒或240帧。在这项工作中,我们展示了现有模型可以自然地扩展为自回归视频扩散模型,而无需改变架构。我们的关键思想是为潜在帧分配逐渐增加的噪声水平,而不是单一噪声水平,这允许在潜变量之间进行细粒度条件设置,并在注意力窗口之间产生大的重叠。这种渐进式视频去噪使我们的模型能够自回归生成视频帧,而不会出现质量下降或突然的场景变化。我们在长视频生成方面展示了最新的成果,达到了1分钟(24 FPS下的1440帧)。本文的视频可在https://desaixie.github.io/pa-vdm/上获取。
English
Current frontier video diffusion models have demonstrated remarkable results
at generating high-quality videos. However, they can only generate short video
clips, normally around 10 seconds or 240 frames, due to computation limitations
during training. In this work, we show that existing models can be naturally
extended to autoregressive video diffusion models without changing the
architectures. Our key idea is to assign the latent frames with progressively
increasing noise levels rather than a single noise level, which allows for
fine-grained condition among the latents and large overlaps between the
attention windows. Such progressive video denoising allows our models to
autoregressively generate video frames without quality degradation or abrupt
scene changes. We present state-of-the-art results on long video generation at
1 minute (1440 frames at 24 FPS). Videos from this paper are available at
https://desaixie.github.io/pa-vdm/.Summary
AI-Generated Summary