ChatPaper.aiChatPaper

双并行机制下的分钟级视频处理

Minute-Long Videos with Dual Parallelisms

May 27, 2025
作者: Zeqing Wang, Bowen Zheng, Xingyi Yang, Yuecong Xu, Xinchao Wang
cs.AI

摘要

基于扩散Transformer(DiT)的视频扩散模型能够大规模生成高质量视频,但在处理长视频时面临极高的处理延迟和内存开销。为解决这一问题,我们提出了一种新颖的分布式推理策略,称为DualParal。其核心思想在于,不再依赖单一GPU生成完整视频,而是将时间帧与模型层并行化分配至多个GPU。然而,这种划分的简单实现存在一个关键限制:由于扩散模型要求各帧间的噪声水平同步,直接并行化会导致原有的并行性被串行化。为此,我们采用了一种分块去噪方案,即通过逐步降低噪声水平的方式处理一系列帧块。每个GPU负责特定帧块和层子集,同时将前序结果传递至下一个GPU,从而实现异步计算与通信。为进一步优化性能,我们引入了两项关键改进。首先,在每个GPU上实现特征缓存,用于存储并复用前一帧块的特征作为上下文,最大限度地减少GPU间通信及冗余计算。其次,采用协调的噪声初始化策略,通过在各GPU间共享初始噪声模式,确保全局时间动态的一致性,且无需额外资源开销。这些措施共同实现了快速、无伪影且无限长度的视频生成。应用于最新的扩散Transformer视频生成器,我们的方法在8块RTX 4090 GPU上高效生成了1,025帧视频,延迟降低至6.54倍,内存成本减少1.48倍。
English
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos. To address this, we propose a novel distributed inference strategy, termed DualParal. The core idea is that, instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs. However, a naive implementation of this division faces a key limitation: since diffusion models require synchronized noise levels across frames, this implementation leads to the serialization of original parallelisms. We leverage a block-wise denoising scheme to handle this. Namely, we process a sequence of frame blocks through the pipeline with progressively decreasing noise levels. Each GPU handles a specific block and layer subset while passing previous results to the next GPU, enabling asynchronous computation and communication. To further optimize performance, we incorporate two key enhancements. Firstly, a feature cache is implemented on each GPU to store and reuse features from the prior block as context, minimizing inter-GPU communication and redundant computation. Secondly, we employ a coordinated noise initialization strategy, ensuring globally consistent temporal dynamics by sharing initial noise patterns across GPUs without extra resource costs. Together, these enable fast, artifact-free, and infinitely long video generation. Applied to the latest diffusion transformer video generator, our method efficiently produces 1,025-frame videos with up to 6.54times lower latency and 1.48times lower memory cost on 8timesRTX 4090 GPUs.

Summary

AI-Generated Summary

PDF42May 28, 2025