金字塔流匹配用於高效的影片生成建模

摘要

視頻生成需要對龐大的時空空間進行建模，這需要大量的計算資源和數據使用。為了降低複雜性，主流方法採用了級聯架構，以避免直接使用全分辨率進行訓練。儘管降低了計算需求，但每個子階段的分開優化阻礙了知識共享並犧牲了靈活性。本研究引入了一種統一的金字塔流匹配算法。它重新解釋了原始的去噪軌跡為一系列金字塔階段，其中僅最終階段在全分辨率下運行，從而實現了更高效的視頻生成建模。通過我們精心設計，不同金字塔階段的流可以相互聯繫以保持連續性。此外，我們通過時間金字塔創建自回歸視頻生成，以壓縮全分辨率歷史。整個框架可以以端到端的方式進行優化，並使用單一統一的擴散Transformer（DiT）。大量實驗表明，我們的方法支持在20.7k A100 GPU訓練小時內生成高質量的768p分辨率和24 FPS的5秒（最多10秒）視頻。所有代碼和模型將在https://pyramid-flow.github.io 開源。

English

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models will be open-sourced at https://pyramid-flow.github.io.

金字塔流匹配用於高效的影片生成建模

Pyramidal Flow Matching for Efficient Video Generative Modeling

摘要

Support