ChatPaper.aiChatPaper

雙重並行性的一分鐘短片

Minute-Long Videos with Dual Parallelisms

May 27, 2025
作者: Zeqing Wang, Bowen Zheng, Xingyi Yang, Yuecong Xu, Xinchao Wang
cs.AI

摘要

基于扩散变换器(DiT)的视频扩散模型能够大规模生成高质量视频,但在处理长视频时会产生过高的处理延迟和内存成本。为解决这一问题,我们提出了一种新颖的分布式推理策略,称为DualParal。其核心思想是,不再在单个GPU上生成整个视频,而是将时间帧和模型层并行化到多个GPU上。然而,这种划分的简单实现面临一个关键限制:由于扩散模型要求跨帧的噪声水平同步,这种实现会导致原始并行性的串行化。我们采用了一种分块去噪方案来处理这一问题。具体而言,我们通过管道处理一系列帧块,噪声水平逐渐降低。每个GPU处理特定的帧块和层子集,同时将先前的结果传递给下一个GPU,从而实现异步计算和通信。为了进一步优化性能,我们引入了两项关键增强措施。首先,在每个GPU上实现了一个特征缓存,用于存储和重用前一块的特征作为上下文,从而最小化GPU间通信和冗余计算。其次,我们采用了一种协调的噪声初始化策略,通过跨GPU共享初始噪声模式,确保全局一致的时间动态,而无需额外的资源成本。这些措施共同实现了快速、无伪影且无限长的视频生成。应用于最新的扩散变换器视频生成器,我们的方法在8块RTX 4090 GPU上高效生成了1,025帧视频,延迟降低了6.54倍,内存成本降低了1.48倍。
English
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

Summary

AI-Generated Summary

PDF52May 28, 2025