在视频传播中重新定义时间建模：矢量化时间步进方法

摘要

扩散模型已经彻底改变了图像生成的方式，将其拓展到视频生成领域也显示出了潜力。然而，当前的视频扩散模型（VDMs）依赖于在剪辑级别应用的标量时间步变量，这限制了它们对于像图像到视频生成这样需要复杂时间依赖性的任务的建模能力。为了解决这一局限性，我们提出了一种帧感知视频扩散模型（FVDM），引入了一种新颖的矢量化时间步变量（VTV）。与传统的VDMs不同，我们的方法允许每一帧遵循独立的噪声时间表，增强了模型捕捉细粒度时间依赖性的能力。FVDM的灵活性在多个任务中得到展示，包括标准视频生成、图像到视频生成、视频插值和长视频合成。通过多样的VTV配置，我们在生成的视频质量上取得了卓越的表现，克服了在微调过程中的灾难性遗忘和零样本方法中有限的泛化能力等挑战。我们的实证评估表明，FVDM在视频生成质量方面优于最先进的方法，同时在扩展任务中也表现出色。通过解决现有VDMs的基本缺陷，FVDM在视频合成领域树立了新的范式，提供了一个具有重要生成建模和多媒体应用意义的强大框架。

English

Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

在视频传播中重新定义时间建模：矢量化时间步进方法

Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

摘要

Support