EasyAnimate:基于Transformer架构的高性能长视频生成方法
EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture
May 29, 2024
作者: Jiaqi Xu, Xinyi Zou, Kunzhe Huang, Yunkuo Chen, Bo Liu, MengLi Cheng, Xing Shi, Jun Huang
cs.AI
摘要
本文介绍了EasyAnimate,这是一种先进的视频生成方法,利用变压器架构的强大功能实现高性能结果。我们将最初设计用于2D图像合成的DiT框架进行了扩展,以适应3D视频生成的复杂性,其中包括了一个运动模块块。该模块用于捕捉时间动态,从而确保生成一致的帧和流畅的运动过渡。运动模块可以适应各种DiT基线方法,以生成具有不同风格的视频。它还可以在训练和推断阶段生成具有不同帧率和分辨率的视频,适用于图像和视频。此外,我们引入了切片VAE,这是一种压缩时间轴的新方法,有助于生成长时间视频。目前,EasyAnimate展示了生成包含144帧视频的能力。我们提供了基于DiT的视频制作全面生态系统,涵盖数据预处理、VAE训练、DiT模型训练(基线模型和LoRA模型)、以及端到端视频推断等方面。代码可在以下链接找到:https://github.com/aigc-apps/EasyAnimate。我们将持续努力提升我们方法的性能。
English
This paper presents EasyAnimate, an advanced method for video generation that
leverages the power of transformer architecture for high-performance outcomes.
We have expanded the DiT framework originally designed for 2D image synthesis
to accommodate the complexities of 3D video generation by incorporating a
motion module block. It is used to capture temporal dynamics, thereby ensuring
the production of consistent frames and seamless motion transitions. The motion
module can be adapted to various DiT baseline methods to generate video with
different styles. It can also generate videos with different frame rates and
resolutions during both training and inference phases, suitable for both images
and videos. Moreover, we introduce slice VAE, a novel approach to condense the
temporal axis, facilitating the generation of long duration videos. Currently,
EasyAnimate exhibits the proficiency to generate videos with 144 frames. We
provide a holistic ecosystem for video production based on DiT, encompassing
aspects such as data pre-processing, VAE training, DiT models training (both
the baseline model and LoRA model), and end-to-end video inference. Code is
available at: https://github.com/aigc-apps/EasyAnimate. We are continuously
working to enhance the performance of our method.Summary
AI-Generated Summary