双并行机制下的分钟级视频处理

摘要

基于扩散Transformer（DiT）的视频扩散模型能够大规模生成高质量视频，但在处理长视频时面临极高的处理延迟和内存开销。为解决这一问题，我们提出了一种新颖的分布式推理策略，称为DualParal。其核心思想在于，不再依赖单一GPU生成完整视频，而是将时间帧与模型层并行化分配至多个GPU。然而，这种划分的简单实现存在一个关键限制：由于扩散模型要求各帧间的噪声水平同步，直接并行化会导致原有的并行性被串行化。为此，我们采用了一种分块去噪方案，即通过逐步降低噪声水平的方式处理一系列帧块。每个GPU负责特定帧块和层子集，同时将前序结果传递至下一个GPU，从而实现异步计算与通信。为进一步优化性能，我们引入了两项关键改进。首先，在每个GPU上实现特征缓存，用于存储并复用前一帧块的特征作为上下文，最大限度地减少GPU间通信及冗余计算。其次，采用协调的噪声初始化策略，通过在各GPU间共享初始噪声模式，确保全局时间动态的一致性，且无需额外资源开销。这些措施共同实现了快速、无伪影且无限长度的视频生成。应用于最新的扩散Transformer视频生成器，我们的方法在8块RTX 4090 GPU上高效生成了1,025帧视频，延迟降低至6.54倍，内存成本减少1.48倍。

English

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

双并行机制下的分钟级视频处理

Minute-Long Videos with Dual Parallelisms

摘要

Support