StreamDiffusion:用於實時互動生成的管道級解決方案
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
December 19, 2023
作者: Akio Kodaira, Chenfeng Xu, Toshiki Hazama, Takanori Yoshimoto, Kohei Ohno, Shogo Mitsuhori, Soichi Sugano, Hanying Cho, Zhijian Liu, Kurt Keutzer
cs.AI
摘要
我們介紹了StreamDiffusion,一個專為互動式圖像生成而設計的即時擴散管道。現有的擴散模型擅長從文本或圖像提示中創建圖像,但在實時互動方面常常表現不佳。這種限制在涉及連續輸入的情況下尤為明顯,例如Metaverse、直播視頻流和廣播,這些情況下高吞吐量至關重要。為了應對這一挑戰,我們提出了一種新方法,將原始的順序去噪轉換為批量去噪過程。Stream Batch消除了傳統的等待互動方法,實現了流暢且高吞吐量的流程。為了應對數據輸入與模型吞吐量之間的頻率差異,我們設計了一個新的輸入-輸出隊列,以實現流式處理的並行化。此外,現有的擴散管道使用無分類器引導(CFG),需要額外的U-Net計算。為了減少冗餘計算,我們提出了一種新的殘差無分類器引導(RCFG)算法,將負條件去噪步驟的數量減少到只有一個甚至零個。此外,我們引入了一種隨機相似性濾波器(SSF)來優化功耗。我們的Stream Batch在不同去噪水平上實現了約1.5倍的加速,比順序去噪方法快。所提出的RCFG的速度比傳統CFG快高達2.05倍。結合所提出的策略和現有成熟的加速工具,使圖像生成達到每秒91.07幀的速度,這是Diffusers開發的AutoPipline的吞吐量提高了59.56倍。此外,我們提出的StreamDiffusion還將能源消耗在一個RTX3060上降低了2.39倍,在一個RTX4090上降低了1.99倍。
English
We introduce StreamDiffusion, a real-time diffusion pipeline designed for
interactive image generation. Existing diffusion models are adept at creating
images from text or image prompts, yet they often fall short in real-time
interaction. This limitation becomes particularly evident in scenarios
involving continuous input, such as Metaverse, live video streaming, and
broadcasting, where high throughput is imperative. To address this, we present
a novel approach that transforms the original sequential denoising into the
batching denoising process. Stream Batch eliminates the conventional
wait-and-interact approach and enables fluid and high throughput streams. To
handle the frequency disparity between data input and model throughput, we
design a novel input-output queue for parallelizing the streaming process.
Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG),
which requires additional U-Net computation. To mitigate the redundant
computations, we propose a novel residual classifier-free guidance (RCFG)
algorithm that reduces the number of negative conditional denoising steps to
only one or even zero. Besides, we introduce a stochastic similarity
filter(SSF) to optimize power consumption. Our Stream Batch achieves around
1.5x speedup compared to the sequential denoising method at different denoising
levels. The proposed RCFG leads to speeds up to 2.05x higher than the
conventional CFG. Combining the proposed strategies and existing mature
acceleration tools makes the image-to-image generation achieve up-to 91.07fps
on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers
over 59.56x. Furthermore, our proposed StreamDiffusion also significantly
reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one
RTX4090, respectively.