xGen-VideoSyn-1: 圧縮された表現を用いた高精細なテキストからビデオへの合成

要旨

私たちは、テキストからビデオ（T2V）を生成するxGen-VideoSyn-1というモデルを提案します。このモデルは、テキストの記述から現実的なシーンを生成することができます。最近の進歩、例えばOpenAIのSoraなどに基づいて、潜在拡散モデル（LDM）アーキテクチャを探求し、ビデオ変分オートエンコーダー（VidVAE）を導入しています。VidVAEは、ビデオデータを空間的および時間的に圧縮し、視覚トークンの長さと長いシーケンスのビデオ生成に伴う計算要件を大幅に削減します。計算コストにさらに対処するために、時間的一貫性を保持する分割と統合の戦略を提案しています。私たちの拡散トランスフォーマー（DiT）モデルは、空間的および時間的自己注意層を組み込んでおり、異なる時間枠やアスペクト比にわたる堅牢な汎化を実現しています。私たちは、最初からデータ処理パイプラインを設計し、1300万以上の高品質なビデオテキストペアを収集しました。このパイプラインには、クリッピング、テキスト検出、動きの推定、美的スコアリング、および自社のビデオ-LLMモデルに基づく密なキャプショニングなど、複数のステップが含まれています。VidVAEとDiTモデルのトレーニングには、それぞれ約40日と642日のH100が必要でした。私たちのモデルは、エンドツーエンドで14秒以上の720pビデオ生成をサポートし、最先端のT2Vモデルに対して競争力のあるパフォーマンスを示しています。

English

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

xGen-VideoSyn-1: 圧縮された表現を用いた高精細なテキストからビデオへの合成

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

要旨

Support