金字塔流匹配用于高效视频生成建模
Pyramidal Flow Matching for Efficient Video Generative Modeling
October 8, 2024
作者: Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin
cs.AI
摘要
视频生成需要对广阔的时空空间进行建模,这需要大量的计算资源和数据使用。为了降低复杂性,目前的方法采用级联架构,避免直接使用完整分辨率进行训练。尽管降低了计算需求,但每个子阶段的单独优化阻碍了知识共享并牺牲了灵活性。本文介绍了一种统一的金字塔流匹配算法。它重新解释了原始去噪轨迹为一系列金字塔阶段,其中只有最终阶段在完整分辨率下运行,从而实现更高效的视频生成建模。通过我们精心设计,不同金字塔阶段的流可以相互关联以保持连续性。此外,我们通过使用时间金字塔来压缩完整分辨率历史,实现了自回归视频生成。整个框架可以以端到端的方式进行优化,并使用单一统一的扩散Transformer(DiT)。大量实验证明,我们的方法支持在20.7k A100 GPU训练小时内生成768p分辨率、24 FPS的高质量5秒(最多10秒)视频。所有代码和模型将在https://pyramid-flow.github.io 开源。
English
Video generation requires modeling a vast spatiotemporal space, which demands
significant computational resources and data usage. To reduce the complexity,
the prevailing approaches employ a cascaded architecture to avoid direct
training with full resolution. Despite reducing computational demands, the
separate optimization of each sub-stage hinders knowledge sharing and
sacrifices flexibility. This work introduces a unified pyramidal flow matching
algorithm. It reinterprets the original denoising trajectory as a series of
pyramid stages, where only the final stage operates at the full resolution,
thereby enabling more efficient video generative modeling. Through our
sophisticated design, the flows of different pyramid stages can be interlinked
to maintain continuity. Moreover, we craft autoregressive video generation with
a temporal pyramid to compress the full-resolution history. The entire
framework can be optimized in an end-to-end manner and with a single unified
Diffusion Transformer (DiT). Extensive experiments demonstrate that our method
supports generating high-quality 5-second (up to 10-second) videos at 768p
resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models
will be open-sourced at https://pyramid-flow.github.io.Summary
AI-Generated Summary