CogVideoX: エキスパートトランスフォーマーを備えたテキストからビデオへの拡散モデル

要旨

CogVideoXを紹介します。これは、テキストプロンプトに基づいて動画を生成するために設計された大規模な拡散トランスフォーマーモデルです。動画データを効率的にモデル化するために、空間的および時間的次元に沿って動画を圧縮する3D変分オートエンコーダ（VAE）を活用することを提案します。テキストと動画の整合性を向上させるために、エキスパートトランスフォーマーとエキスパート適応型LayerNormを提案し、二つのモダリティ間の深い融合を促進します。段階的なトレーニング技術を採用することで、CogVideoXは、大幅な動きを特徴とする一貫性のある長時間の動画を生成するのに適しています。さらに、様々なデータ前処理戦略と動画キャプショニング方法を含む効果的なテキスト-動画データ処理パイプラインを開発します。これは、CogVideoXのパフォーマンスを大幅に向上させ、生成品質と意味的整合性の両方を改善するのに役立ちます。結果は、CogVideoXが複数の機械的メトリクスと人間の評価の両方で最先端のパフォーマンスを示すことを示しています。3D因果VAEとCogVideoXのモデル重みは、https://github.com/THUDM/CogVideoで公開されています。

English

We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of both the 3D Causal VAE and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.