Lumina-Video：使用多尺度 Next-DiT 實現高效靈活的影片生成

摘要

最近的進展已確立擴散Transformer（DiTs）作為生成建模中的主要框架。在此成功基礎上，Lumina-Next通過Next-DiT在生成逼真圖像方面取得卓越表現。然而，其在視頻生成方面的潛力仍大部分未被開發，面臨著在建模視頻數據固有的時空複雜性方面的重大挑戰。為解決這一問題，我們引入了Lumina-Video，該框架利用Next-DiT的優勢，同時為視頻合成引入了量身定制的解決方案。Lumina-Video採用了多尺度Next-DiT架構，共同學習多個補丁化，以增強效率和靈活性。通過將運動分數作為顯式條件，Lumina-Video還實現了對生成視頻動態程度的直接控制。結合逐步訓練方案，逐漸提高分辨率和FPS，以及多源訓練方案，混合自然和合成數據，Lumina-Video在高訓練和推斷效率下實現了卓越的美學質量和運動平滑度。此外，我們還提出了基於Next-DiT的視頻到音頻模型Lumina-V2A，為生成的視頻創建同步音效。代碼已在https://www.github.com/Alpha-VLLM/Lumina-Video上發布。

English

Recent advancements have established Diffusion Transformers (DiTs) as a dominant framework in generative modeling. Building on this success, Lumina-Next achieves exceptional performance in the generation of photorealistic images with Next-DiT. However, its potential for video generation remains largely untapped, with significant challenges in modeling the spatiotemporal complexity inherent to video data. To address this, we introduce Lumina-Video, a framework that leverages the strengths of Next-DiT while introducing tailored solutions for video synthesis. Lumina-Video incorporates a Multi-scale Next-DiT architecture, which jointly learns multiple patchifications to enhance both efficiency and flexibility. By incorporating the motion score as an explicit condition, Lumina-Video also enables direct control of generated videos' dynamic degree. Combined with a progressive training scheme with increasingly higher resolution and FPS, and a multi-source training scheme with mixed natural and synthetic data, Lumina-Video achieves remarkable aesthetic quality and motion smoothness at high training and inference efficiency. We additionally propose Lumina-V2A, a video-to-audio model based on Next-DiT, to create synchronized sounds for generated videos. Codes are released at https://www.github.com/Alpha-VLLM/Lumina-Video.

Lumina-Video：使用多尺度 Next-DiT 實現高效靈活的影片生成

Lumina-Video: Efficient and Flexible Video Generation with Multi-scale Next-DiT

摘要

Support