通过内容帧运动潜分解实现高效视频扩散模型
Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition
March 21, 2024
作者: Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar
cs.AI
摘要
视频扩散模型最近在生成质量方面取得了巨大进展,但仍然受到高内存和计算需求的限制。这是因为当前的视频扩散模型通常尝试直接处理高维视频。为了解决这个问题,我们提出内容-运动潜在扩散模型(CMD),这是对预训练图像扩散模型进行视频生成的一种新颖高效扩展。具体来说,我们提出了一个自动编码器,将视频简洁地编码为内容帧(类似于图像)和低维运动潜在表示的组合。前者代表通用内容,后者分别代表视频中的潜在运动。我们通过微调预训练图像扩散模型生成内容帧,并通过训练新的轻量级扩散模型生成运动潜在表示。这里的一个关键创新是设计一个紧凑的潜在空间,可以直接利用预训练图像扩散模型,这在以前的潜在视频扩散模型中尚未实现。这导致了更好质量的生成和降低的计算成本。例如,CMD可以比先前方法快7.7倍地对512x1024分辨率和长度为16的视频进行采样,仅需3.1秒。此外,CMD在WebVid-10M上取得212.7的FVD分数,比之前的292.4的最新技术水平提高了27.3%。
English
Video diffusion models have recently made great progress in generation
quality, but are still limited by the high memory and computational
requirements. This is because current video diffusion models often attempt to
process high-dimensional videos directly. To tackle this issue, we propose
content-motion latent diffusion model (CMD), a novel efficient extension of
pretrained image diffusion models for video generation. Specifically, we
propose an autoencoder that succinctly encodes a video as a combination of a
content frame (like an image) and a low-dimensional motion latent
representation. The former represents the common content, and the latter
represents the underlying motion in the video, respectively. We generate the
content frame by fine-tuning a pretrained image diffusion model, and we
generate the motion latent representation by training a new lightweight
diffusion model. A key innovation here is the design of a compact latent space
that can directly utilizes a pretrained image diffusion model, which has not
been done in previous latent video diffusion models. This leads to considerably
better quality generation and reduced computational costs. For instance, CMD
can sample a video 7.7times faster than prior approaches by generating a
video of 512times1024 resolution and length 16 in 3.1 seconds. Moreover, CMD
achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous
state-of-the-art of 292.4.Summary
AI-Generated Summary