雙重並行性的一分鐘短片

摘要

基于扩散变换器（DiT）的视频扩散模型能够大规模生成高质量视频，但在处理长视频时会产生过高的处理延迟和内存成本。为解决这一问题，我们提出了一种新颖的分布式推理策略，称为DualParal。其核心思想是，不再在单个GPU上生成整个视频，而是将时间帧和模型层并行化到多个GPU上。然而，这种划分的简单实现面临一个关键限制：由于扩散模型要求跨帧的噪声水平同步，这种实现会导致原始并行性的串行化。我们采用了一种分块去噪方案来处理这一问题。具体而言，我们通过管道处理一系列帧块，噪声水平逐渐降低。每个GPU处理特定的帧块和层子集，同时将先前的结果传递给下一个GPU，从而实现异步计算和通信。为了进一步优化性能，我们引入了两项关键增强措施。首先，在每个GPU上实现了一个特征缓存，用于存储和重用前一块的特征作为上下文，从而最小化GPU间通信和冗余计算。其次，我们采用了一种协调的噪声初始化策略，通过跨GPU共享初始噪声模式，确保全局一致的时间动态，而无需额外的资源成本。这些措施共同实现了快速、无伪影且无限长的视频生成。应用于最新的扩散变换器视频生成器，我们的方法在8块RTX 4090 GPU上高效生成了1,025帧视频，延迟降低了6.54倍，内存成本降低了1.48倍。

English

Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

雙重並行性的一分鐘短片

Minute-Long Videos with Dual Parallelisms

摘要

Support