StreamDiffusion:用于实时交互式生成的管道级解决方案
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
December 19, 2023
作者: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer
cs.AI
摘要
我们介绍了StreamDiffusion,这是一个专为交互式图像生成设计的实时扩散管道。现有的扩散模型擅长根据文本或图像提示创建图像,但它们在实时交互方面往往表现不佳。这种限制在涉及连续输入的场景中尤为明显,比如元宇宙、实时视频流和广播,这些场景中高吞吐量至关重要。为了解决这个问题,我们提出了一种新颖的方法,将原始的顺序去噪转变为批量去噪过程。Stream Batch消除了传统的等待交互方法,并实现了流畅高吞吐量的流程。为了处理数据输入与模型吞吐量之间的频率差异,我们设计了一种新颖的输入输出队列,用于并行化流处理过程。此外,现有的扩散管道使用无分类器引导(CFG),需要额外的U-Net计算。为了减少冗余计算,我们提出了一种新颖的残差无分类器引导(RCFG)算法,将负条件去噪步骤的数量减少到仅为一步甚至零步。此外,我们引入了随机相似性过滤器(SSF)来优化功耗。我们的Stream Batch在不同去噪级别下比顺序去噪方法实现了约1.5倍的加速。所提出的RCFG比传统CFG的速度提高了高达2.05倍。结合所提出的策略和现有成熟的加速工具,使图像生成在一块RTX4090上实现了高达91.07fps的吞吐量,将Diffusers开发的AutoPipline的吞吐量提高了59.56倍以上。此外,我们提出的StreamDiffusion还将能耗在一块RTX3060和一块RTX4090上分别降低了2.39倍和1.99倍。
English
We introduce StreamDiffusion, a real-time diffusion pipeline designed for
interactive image generation. Existing diffusion models are adept at creating
images from text or image prompts, yet they often fall short in real-time
interaction. This limitation becomes particularly evident in scenarios
involving continuous input, such as Metaverse, live video streaming, and
broadcasting, where high throughput is imperative. To address this, we present
a novel approach that transforms the original sequential denoising into the
batching denoising process. Stream Batch eliminates the conventional
wait-and-interact approach and enables fluid and high throughput streams. To
handle the frequency disparity between data input and model throughput, we
design a novel input-output queue for parallelizing the streaming process.
Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG),
which requires additional U-Net computation. To mitigate the redundant
computations, we propose a novel residual classifier-free guidance (RCFG)
algorithm that reduces the number of negative conditional denoising steps to
only one or even zero. Besides, we introduce a stochastic similarity
filter(SSF) to optimize power consumption. Our Stream Batch achieves around
1.5x speedup compared to the sequential denoising method at different denoising
levels. The proposed RCFG leads to speeds up to 2.05x higher than the
conventional CFG. Combining the proposed strategies and existing mature
acceleration tools makes the image-to-image generation achieve up-to 91.07fps
on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers
over 59.56x. Furthermore, our proposed StreamDiffusion also significantly
reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one
RTX4090, respectively.