StreamDiffusion：用於實時互動生成的管道級解決方案

摘要

我們介紹了StreamDiffusion，一個專為互動式圖像生成而設計的即時擴散管道。現有的擴散模型擅長從文本或圖像提示中創建圖像，但在實時互動方面常常表現不佳。這種限制在涉及連續輸入的情況下尤為明顯，例如Metaverse、直播視頻流和廣播，這些情況下高吞吐量至關重要。為了應對這一挑戰，我們提出了一種新方法，將原始的順序去噪轉換為批量去噪過程。Stream Batch消除了傳統的等待互動方法，實現了流暢且高吞吐量的流程。為了應對數據輸入與模型吞吐量之間的頻率差異，我們設計了一個新的輸入-輸出隊列，以實現流式處理的並行化。此外，現有的擴散管道使用無分類器引導（CFG），需要額外的U-Net計算。為了減少冗餘計算，我們提出了一種新的殘差無分類器引導（RCFG）算法，將負條件去噪步驟的數量減少到只有一個甚至零個。此外，我們引入了一種隨機相似性濾波器（SSF）來優化功耗。我們的Stream Batch在不同去噪水平上實現了約1.5倍的加速，比順序去噪方法快。所提出的RCFG的速度比傳統CFG快高達2.05倍。結合所提出的策略和現有成熟的加速工具，使圖像生成達到每秒91.07幀的速度，這是Diffusers開發的AutoPipline的吞吐量提高了59.56倍。此外，我們提出的StreamDiffusion還將能源消耗在一個RTX3060上降低了2.39倍，在一個RTX4090上降低了1.99倍。

English

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

StreamDiffusion：用於實時互動生成的管道級解決方案

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

摘要

Support