通过内容帧运动潜分解实现高效视频扩散模型

摘要

视频扩散模型最近在生成质量方面取得了巨大进展，但仍然受到高内存和计算需求的限制。这是因为当前的视频扩散模型通常尝试直接处理高维视频。为了解决这个问题，我们提出内容-运动潜在扩散模型（CMD），这是对预训练图像扩散模型进行视频生成的一种新颖高效扩展。具体来说，我们提出了一个自动编码器，将视频简洁地编码为内容帧（类似于图像）和低维运动潜在表示的组合。前者代表通用内容，后者分别代表视频中的潜在运动。我们通过微调预训练图像扩散模型生成内容帧，并通过训练新的轻量级扩散模型生成运动潜在表示。这里的一个关键创新是设计一个紧凑的潜在空间，可以直接利用预训练图像扩散模型，这在以前的潜在视频扩散模型中尚未实现。这导致了更好质量的生成和降低的计算成本。例如，CMD可以比先前方法快7.7倍地对512x1024分辨率和长度为16的视频进行采样，仅需3.1秒。此外，CMD在WebVid-10M上取得212.7的FVD分数，比之前的292.4的最新技术水平提高了27.3%。

English

Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7times faster than prior approaches by generating a video of 512times1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

通过内容帧运动潜分解实现高效视频扩散模型

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

摘要

Support