OD-VAE:一種全方位影片壓縮器,用於改善潛在影片擴散模型。
OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model
September 2, 2024
作者: Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinghua Cheng, Li Yuan
cs.AI
摘要
變分自編碼器(VAE)將影片壓縮為潛在表示,是潛在影片擴散模型(LVDMs)中至關重要的前置組件。在保持相同重建品質的情況下,VAE對影片的壓縮越充分,LVDMs的效率就越高。然而,大多數LVDMs使用2D影像VAE,其對影片的壓縮僅在空間維度,而往往忽略了時間維度。如何在VAE中對影片進行時間壓縮,以獲得更簡潔的潛在表示,同時保證準確的重建,這方面的研究很少。為了填補這一空白,我們提出了一種全方位壓縮VAE,名為OD-VAE,可以在時間和空間上壓縮影片。儘管OD-VAE更充分的壓縮為影片重建帶來了巨大挑戰,但通過我們的精心設計,仍然可以實現高重建準確度。為了在影片重建品質和壓縮速度之間取得更好的平衡,我們介紹並分析了四種OD-VAE的變體。此外,設計了一種新型尾部初始化方法,以更有效地訓練OD-VAE,並提出了一種新型推理策略,使OD-VAE能夠處理長度任意的影片並限制GPU內存。對影片重建和基於LVDM的影片生成進行的全面實驗證明了我們提出方法的有效性和效率。
English
Variational Autoencoder (VAE), compressing videos into latent
representations, is a crucial preceding component of Latent Video Diffusion
Models (LVDMs). With the same reconstruction quality, the more sufficient the
VAE's compression for videos is, the more efficient the LVDMs are. However,
most LVDMs utilize 2D image VAE, whose compression for videos is only in the
spatial dimension and often ignored in the temporal dimension. How to conduct
temporal compression for videos in a VAE to obtain more concise latent
representations while promising accurate reconstruction is seldom explored. To
fill this gap, we propose an omni-dimension compression VAE, named OD-VAE,
which can temporally and spatially compress videos. Although OD-VAE's more
sufficient compression brings a great challenge to video reconstruction, it can
still achieve high reconstructed accuracy by our fine design. To obtain a
better trade-off between video reconstruction quality and compression speed,
four variants of OD-VAE are introduced and analyzed. In addition, a novel tail
initialization is designed to train OD-VAE more efficiently, and a novel
inference strategy is proposed to enable OD-VAE to handle videos of arbitrary
length with limited GPU memory. Comprehensive experiments on video
reconstruction and LVDM-based video generation demonstrate the effectiveness
and efficiency of our proposed methods.Summary
AI-Generated Summary