ChatPaper.aiChatPaper

通過內容-幀運動-潛在分解實現高效視頻擴散模型

Efficient Video Diffusion Models via Content-Frame Motion-Latent Decomposition

March 21, 2024
作者: Sihyun Yu, Weili Nie, De-An Huang, Boyi Li, Jinwoo Shin, Anima Anandkumar
cs.AI

摘要

最近,影片擴散模型在生成品質方面取得了巨大進展,但仍受限於高記憶體和計算需求。這是因為目前的影片擴散模型通常嘗試直接處理高維度影片。為了應對這個問題,我們提出了內容-運動潛在擴散模型(CMD),這是對預訓練圖像擴散模型進行影片生成的一個新型高效擴展。具體來說,我們提出了一個自編碼器,將影片簡潔地編碼為內容幀(類似圖像)和低維度運動潛在表示的組合。前者代表共同內容,後者分別代表影片中的潛在運動。我們通過微調預訓練圖像擴散模型來生成內容幀,通過訓練一個新的輕量級擴散模型來生成運動潛在表示。這裡的一個關鍵創新是設計了一個緊湊的潛在空間,可以直接利用預訓練圖像擴散模型,這在先前的潛在影片擴散模型中尚未實現。這導致了更好的生成品質和降低的計算成本。例如,CMD可以比以前的方法快7.7倍地對512x1024分辨率和長度為16的影片進行生成,只需3.1秒。此外,CMD在WebVid-10M上實現了212.7的FVD分數,比之前的292.4的最新技術水平提高了27.3%。
English
Video diffusion models have recently made great progress in generation quality, but are still limited by the high memory and computational requirements. This is because current video diffusion models often attempt to process high-dimensional videos directly. To tackle this issue, we propose content-motion latent diffusion model (CMD), a novel efficient extension of pretrained image diffusion models for video generation. Specifically, we propose an autoencoder that succinctly encodes a video as a combination of a content frame (like an image) and a low-dimensional motion latent representation. The former represents the common content, and the latter represents the underlying motion in the video, respectively. We generate the content frame by fine-tuning a pretrained image diffusion model, and we generate the motion latent representation by training a new lightweight diffusion model. A key innovation here is the design of a compact latent space that can directly utilizes a pretrained image diffusion model, which has not been done in previous latent video diffusion models. This leads to considerably better quality generation and reduced computational costs. For instance, CMD can sample a video 7.7times faster than prior approaches by generating a video of 512times1024 resolution and length 16 in 3.1 seconds. Moreover, CMD achieves an FVD score of 212.7 on WebVid-10M, 27.3% better than the previous state-of-the-art of 292.4.

Summary

AI-Generated Summary

PDF221December 15, 2024