ZeroSmooth：無需訓練的擴散器適應高幀率視頻生成

摘要

近年來，影片生成在影片擴散模型出現後取得了顯著進展。許多影片生成模型能夠產生逼真的合成影片，例如穩定影片擴散（SVD）。然而，由於GPU記憶體有限以及建模大量幀的困難，大多數影片模型僅能生成低幀率的影片。訓練影片通常會以特定間隔均勻取樣以進行時間壓縮。先前的方法通常通過在像素空間中訓練影片插值模型作為後處理階段，或者為特定基礎影片模型訓練潛在空間中的插值模型來提高幀率。本文提出了一種針對生成式影片擴散模型的無需訓練的影片插值方法，可通用地應用於不同模型並支持即插即用。我們研究了影片擴散模型特徵空間中的非線性，並將影片模型轉換為具有設計的隱藏狀態校正模塊的自我級聯影片擴散模型。自我級聯結構和校正模塊旨在保持關鍵幀與插值幀之間的時間一致性。對多個熱門影片模型進行了廣泛評估，以證明所提方法的有效性，特別是我們的無需訓練方法甚至與依賴大量計算資源和大規模數據集支持的訓練插值模型相當。

English

Video generation has made remarkable progress in recent years, especially since the advent of the video diffusion models. Many video generation models can produce plausible synthetic videos, e.g., Stable Video Diffusion (SVD). However, most video models can only generate low frame rate videos due to the limited GPU memory as well as the difficulty of modeling a large set of frames. The training videos are always uniformly sampled at a specified interval for temporal compression. Previous methods promote the frame rate by either training a video interpolation model in pixel space as a postprocessing stage or training an interpolation model in latent space for a specific base video model. In this paper, we propose a training-free video interpolation method for generative video diffusion models, which is generalizable to different models in a plug-and-play manner. We investigate the non-linearity in the feature space of video diffusion models and transform a video model into a self-cascaded video diffusion model with incorporating the designed hidden state correction modules. The self-cascaded architecture and the correction module are proposed to retain the temporal consistency between key frames and the interpolated frames. Extensive evaluations are preformed on multiple popular video models to demonstrate the effectiveness of the propose method, especially that our training-free method is even comparable to trained interpolation models supported by huge compute resources and large-scale datasets.

ZeroSmooth：無需訓練的擴散器適應高幀率視頻生成

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

摘要

Support