xGen-VideoSyn-1：使用壓縮表示進行高保真度的文本到視頻合成

摘要

我們提出了xGen-VideoSyn-1，一個文本到視頻（T2V）生成模型，能夠從文本描述中生成逼真的場景。借鑒了最近的進展，如OpenAI的Sora，我們探索了潛在擴散模型（LDM）架構並引入了視頻變分自編碼器（VidVAE）。VidVAE在空間和時間上壓縮視頻數據，顯著降低了視覺標記的長度以及生成長序列視頻所需的計算需求。為了進一步應對計算成本，我們提出了一種分割和合併策略，以保持視頻片段之間的時間一致性。我們的擴散Transformer（DiT）模型融合了空間和時間自注意力層，實現了在不同時間框架和寬高比之間的強大泛化。我們從一開始設計了數據處理流水線，並收集了超過1300萬高質量的視頻文本對。該流水線包括多個步驟，如剪輯、文本檢測、運動估計、美學評分以及基於我們內部視頻-LLM模型的密集字幕生成。訓練VidVAE和DiT模型分別需要約40和642 H100天。我們的模型支持端到端的超過14秒720p視頻生成，並展示了與最先進的T2V模型競爭力的表現。

English

We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

xGen-VideoSyn-1：使用壓縮表示進行高保真度的文本到視頻合成

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

摘要

Support