OD-VAE:一种用于改善潜在视频扩散模型的全维度视频压缩器
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
September 2, 2024
作者: Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinghua Cheng, Li Yuan
cs.AI
摘要
变分自编码器(VAE)将视频压缩为潜在表示,是潜在视频扩散模型(LVDMs)的一个至关重要的前置组件。在保持相同重建质量的情况下,VAE对视频的压缩越充分,LVDMs的效率就越高。然而,大多数LVDMs使用二维图像VAE,其对视频的压缩仅在空间维度上,往往忽略了时间维度。如何在VAE中对视频进行时间压缩,以获得更简洁的潜在表示并保证准确的重建,很少有人探讨。为了填补这一空白,我们提出了一种全方位压缩VAE,命名为OD-VAE,可以在时间和空间上压缩视频。尽管OD-VAE更充分的压缩给视频重建带来了巨大挑战,但通过我们精心设计,仍然能够实现高重建准确性。为了在视频重建质量和压缩速度之间取得更好的平衡,我们介绍并分析了四种OD-VAE的变体。此外,设计了一种新颖的尾部初始化方法,以更有效地训练OD-VAE,并提出了一种新颖的推理策略,使OD-VAE能够处理长度任意的视频并限制GPU内存。对视频重建和基于LVDM的视频生成进行的全面实验表明了我们提出方法的有效性和效率。
English
Variational Autoencoder (VAE), compressing videos into latent
representations, is a crucial preceding component of Latent Video Diffusion
Models (LVDMs). With the same reconstruction quality, the more sufficient the
VAE's compression for videos is, the more efficient the LVDMs are. However,
most LVDMs utilize 2D image VAE, whose compression for videos is only in the
spatial dimension and often ignored in the temporal dimension. How to conduct
temporal compression for videos in a VAE to obtain more concise latent
representations while promising accurate reconstruction is seldom explored. To
fill this gap, we propose an omni-dimension compression VAE, named OD-VAE,
which can temporally and spatially compress videos. Although OD-VAE's more
sufficient compression brings a great challenge to video reconstruction, it can
still achieve high reconstructed accuracy by our fine design. To obtain a
better trade-off between video reconstruction quality and compression speed,
four variants of OD-VAE are introduced and analyzed. In addition, a novel tail
initialization is designed to train OD-VAE more efficiently, and a novel
inference strategy is proposed to enable OD-VAE to handle videos of arbitrary
length with limited GPU memory. Comprehensive experiments on video
reconstruction and LVDM-based video generation demonstrate the effectiveness
and efficiency of our proposed methods.Summary
AI-Generated Summary