AsyncDiff：通过异步去噪实现扩散模型的并行化

摘要

扩散模型因其在各种应用中具有强大的生成能力而受到社区的广泛关注。然而，它们典型的多步骤顺序去噪特性导致累积延迟较高，从而排除了并行计算的可能性。为了解决这一问题，我们引入了AsyncDiff，这是一种通用且即插即用的加速方案，可以实现模型并行计算跨多个设备。我们的方法将繁重的噪声预测模型分解为多个组件，将每个组件分配给不同的设备。为了打破这些组件之间的依赖链，它通过利用连续扩散步骤中隐藏状态之间的高相似性，将传统的顺序去噪转换为异步过程。因此，每个组件都可以在不同设备上并行计算。所提出的策略显著降低了推断延迟，同时对生成质量的影响最小。具体来说，对于稳定扩散 v2.1，AsyncDiff 在四个 NVIDIA A5000 GPU 上实现了 2.7 倍的加速，几乎没有降级，并且在 CLIP 分数仅降低 0.38 的情况下实现了 4.0 倍的加速。我们的实验还表明，AsyncDiff 可轻松应用于视频扩散模型，并取得了令人鼓舞的表现。代码可在 https://github.com/czg1225/AsyncDiff 获取。

English

Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.

AsyncDiff：通过异步去噪实现扩散模型的并行化

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

摘要

Support