ExVideo:通過參數高效調整擴展視頻擴散模型
ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
June 20, 2024
作者: Zhongjie Duan, Wenmeng Zhou, Cen Chen, Yaliang Li, Weining Qian
cs.AI
摘要
最近,視頻合成方面取得了顯著進展,引起了廣泛關注。諸如AnimateDiff和Stable Video Diffusion等視頻合成模型展示了擴散模型在創建動態視覺內容方面的實際應用性。SORA的出現進一步突顯了視頻生成技術的潛力。然而,由於計算資源的限制,視頻長度的延伸受到了限制。大多數現有的視頻合成模型僅能生成短視頻片段。在本文中,我們提出了一種新的視頻合成模型後調整方法,名為ExVideo。該方法旨在增強當前視頻合成模型的能力,使其能夠在更長的時間範圍內生成內容,同時減少訓練成本。具體而言,我們分別設計了跨常見時間模型架構的擴展策略,包括3D卷積、時間注意力和位置嵌入。為了評估我們提出的後調整方法的有效性,我們對Stable Video Diffusion模型進行了擴展訓練。我們的方法增強了模型生成帧數的能力,最多可達到原始帧數的5倍,在包含40,000個視頻的數據集上僅需1.5k GPU小時的訓練。重要的是,視頻長度的大幅增加並不會損害模型固有的泛化能力,並且模型在生成各種風格和分辨率的視頻方面展示了其優勢。我們將公開發布源代碼和增強模型。
English
Recently, advancements in video synthesis have attracted significant
attention. Video synthesis models such as AnimateDiff and Stable Video
Diffusion have demonstrated the practical applicability of diffusion models in
creating dynamic visual content. The emergence of SORA has further spotlighted
the potential of video generation technologies. Nonetheless, the extension of
video lengths has been constrained by the limitations in computational
resources. Most existing video synthesis models can only generate short video
clips. In this paper, we propose a novel post-tuning methodology for video
synthesis models, called ExVideo. This approach is designed to enhance the
capability of current video synthesis models, allowing them to produce content
over extended temporal durations while incurring lower training expenditures.
In particular, we design extension strategies across common temporal model
architectures respectively, including 3D convolution, temporal attention, and
positional embedding. To evaluate the efficacy of our proposed post-tuning
approach, we conduct extension training on the Stable Video Diffusion model.
Our approach augments the model's capacity to generate up to 5times its
original number of frames, requiring only 1.5k GPU hours of training on a
dataset comprising 40k videos. Importantly, the substantial increase in video
length doesn't compromise the model's innate generalization capabilities, and
the model showcases its advantages in generating videos of diverse styles and
resolutions. We will release the source code and the enhanced model publicly.Summary
AI-Generated Summary