ChatPaper.aiChatPaper

ExVideo:通过参数高效后调来扩展视频扩散模型

ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

June 20, 2024
作者: Zhongjie Duan, Wenmeng Zhou, Cen Chen, Yaliang Li, Weining Qian
cs.AI

摘要

最近,视频合成方面取得了显著进展,诸如AnimateDiff和Stable Video Diffusion等视频合成模型展示了扩散模型在创造动态视觉内容方面的实际适用性。SORA的出现进一步突显了视频生成技术的潜力。然而,视频长度的延伸受到计算资源限制的约束。大多数现有视频合成模型只能生成短视频片段。在本文中,我们提出了一种新颖的视频合成模型后调优方法,称为ExVideo。该方法旨在增强当前视频合成模型的能力,使其能够在较长的时间跨度内生成内容,同时降低训练成本。具体而言,我们分别设计了跨常见时间模型架构的扩展策略,包括3D卷积、时间注意力和位置嵌入。为了评估我们提出的后调优方法的有效性,我们在Stable Video Diffusion模型上进行了扩展训练。我们的方法增加了模型生成帧数的能力,最多可达到原始帧数的5倍,在包含40,000个视频的数据集上仅需1.5k GPU小时的训练。重要的是,视频长度的显著增加并不会损害模型固有的泛化能力,该模型展示了在生成不同风格和分辨率的视频方面的优势。我们将公开发布源代码和增强模型。
English
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to 5times its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

Summary

AI-Generated Summary

PDF103December 2, 2024