CogVideoX:具有專家Transformer的文本到視頻擴散模型
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
August 12, 2024
作者: Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang
cs.AI
摘要
我們介紹了 CogVideoX,一個大規模擴散變壓器模型,旨在根據文本提示生成視頻。為了有效地建模視頻數據,我們提出利用 3D 變分自編碼器(VAE)來壓縮視頻的空間和時間維度。為了改善文本-視頻對齊,我們提出了一個專家變壓器,配備專家自適應層標準化,以促進兩種模態之間的深度融合。通過採用漸進式訓練技術,CogVideoX 擅長生成具有顯著運動特徵的連貫、長時間視頻。此外,我們開發了一個有效的文本-視頻數據處理流程,其中包括各種數據預處理策略和視頻字幕方法。這明顯有助於提高 CogVideoX 的性能,改善生成質量和語義對齊。結果表明,CogVideoX 在多個機器指標和人類評估方面展現了最先進的性能。3D 因果 VAE 和 CogVideoX 的模型權重均可在 https://github.com/THUDM/CogVideo 公開獲取。
English
We introduce CogVideoX, a large-scale diffusion transformer model designed
for generating videos based on text prompts. To efficently model video data, we
propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along
both spatial and temporal dimensions. To improve the text-video alignment, we
propose an expert transformer with the expert adaptive LayerNorm to facilitate
the deep fusion between the two modalities. By employing a progressive training
technique, CogVideoX is adept at producing coherent, long-duration videos
characterized by significant motions. In addition, we develop an effective
text-video data processing pipeline that includes various data preprocessing
strategies and a video captioning method. It significantly helps enhance the
performance of CogVideoX, improving both generation quality and semantic
alignment. Results show that CogVideoX demonstrates state-of-the-art
performance across both multiple machine metrics and human evaluations. The
model weights of both the 3D Causal VAE and CogVideoX are publicly available at
https://github.com/THUDM/CogVideo.Summary
AI-Generated Summary