CogVideoX:基于专家Transformer的文本到视频扩散模型
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
August 12, 2024
作者: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang
cs.AI
摘要
我们介绍了CogVideoX,这是一个大规模扩散变压器模型,旨在根据文本提示生成视频。为了有效地对视频数据建模,我们建议利用3D变分自动编码器(VAE)来压缩视频的空间和时间维度。为了改善文本-视频对齐,我们提出了一种专家变压器,配备专家自适应层归一化,以促进两种模态之间的深度融合。通过采用渐进式训练技术,CogVideoX擅长生成具有显著运动特征的连贯、长时间视频。此外,我们开发了一种有效的文本-视频数据处理流水线,其中包括各种数据预处理策略和视频字幕方法。这显著有助于增强CogVideoX的性能,提高生成质量和语义对齐度。结果表明,CogVideoX在多个机器指标和人类评估中均表现出最先进的性能。3D因果VAE和CogVideoX的模型权重均可在https://github.com/THUDM/CogVideo 公开获取。
English
We introduce CogVideoX, a large-scale diffusion transformer model designed
for generating videos based on text prompts. To efficently model video data, we
propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along
both spatial and temporal dimensions. To improve the text-video alignment, we
propose an expert transformer with the expert adaptive LayerNorm to facilitate
the deep fusion between the two modalities. By employing a progressive training
technique, CogVideoX is adept at producing coherent, long-duration videos
characterized by significant motions. In addition, we develop an effective
text-video data processing pipeline that includes various data preprocessing
strategies and a video captioning method. It significantly helps enhance the
performance of CogVideoX, improving both generation quality and semantic
alignment. Results show that CogVideoX demonstrates state-of-the-art
performance across both multiple machine metrics and human evaluations. The
model weights of both the 3D Causal VAE and CogVideoX are publicly available at
https://github.com/THUDM/CogVideo.Summary
AI-Generated Summary