ChatPaper.aiChatPaper

xGen-VideoSyn-1:使用壓縮表示進行高保真度的文本到視頻合成

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

August 22, 2024
作者: Can Qin, Congying Xia, Krithika Ramakrishnan, Michael Ryoo, Lifu Tu, Yihao Feng, Manli Shu, Honglu Zhou, Anas Awadalla, Jun Wang, Senthil Purushwalkam, Le Xue, Yingbo Zhou, Huan Wang, Silvio Savarese, Juan Carlos Niebles, Zeyuan Chen, Ran Xu, Caiming Xiong
cs.AI

摘要

我們提出了xGen-VideoSyn-1,一個文本到視頻(T2V)生成模型,能夠從文本描述中生成逼真的場景。借鑒了最近的進展,如OpenAI的Sora,我們探索了潛在擴散模型(LDM)架構並引入了視頻變分自編碼器(VidVAE)。VidVAE在空間和時間上壓縮視頻數據,顯著降低了視覺標記的長度以及生成長序列視頻所需的計算需求。為了進一步應對計算成本,我們提出了一種分割和合併策略,以保持視頻片段之間的時間一致性。我們的擴散Transformer(DiT)模型融合了空間和時間自注意力層,實現了在不同時間框架和寬高比之間的強大泛化。我們從一開始設計了數據處理流水線,並收集了超過1300萬高質量的視頻文本對。該流水線包括多個步驟,如剪輯、文本檢測、運動估計、美學評分以及基於我們內部視頻-LLM模型的密集字幕生成。訓練VidVAE和DiT模型分別需要約40和642 H100天。我們的模型支持端到端的超過14秒720p視頻生成,並展示了與最先進的T2V模型競爭力的表現。
English
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.

Summary

AI-Generated Summary

PDF375November 16, 2024