StreamDiffusion: リアルタイムインタラクティブ生成のためのパイプライン全体のソリューション

要旨

本論文では、インタラクティブな画像生成を目的としたリアルタイム拡散パイプライン「StreamDiffusion」を提案する。既存の拡散モデルはテキストや画像プロンプトからの画像生成に優れているが、リアルタイムインタラクションにおいては限界がある。この制約は、メタバース、ライブビデオストリーミング、放送など、高スループットが求められる連続入力シナリオで特に顕著である。この課題に対処するため、我々は従来の逐次的なノイズ除去プロセスをバッチ処理に変換する新たなアプローチを提示する。Stream Batchは、従来の待機・インタラクション方式を排除し、流動的で高スループットなストリームを実現する。データ入力とモデルスループットの頻度差に対処するため、ストリーム処理を並列化する新たな入出力キューを設計した。さらに、既存の拡散パイプラインは分類器不要ガイダンス（CFG）を使用しており、追加のU-Net計算を必要とする。この冗長な計算を軽減するため、負の条件付きノイズ除去ステップを1回または0回に削減する新たな残差分類器不要ガイダンス（RCFG）アルゴリズムを提案する。また、電力消費を最適化するため、確率的類似性フィルター（SSF）を導入した。我々のStream Batchは、異なるノイズ除去レベルにおいて逐次的なノイズ除去方法と比較して約1.5倍の高速化を達成する。提案したRCFGは、従来のCFGと比較して最大2.05倍の高速化をもたらす。提案した戦略と既存の成熟した高速化ツールを組み合わせることで、1台のRTX4090上で画像間生成が最大91.07fpsを達成し、Diffusersが開発したAutoPiplineのスループットを59.56倍以上向上させた。さらに、提案したStreamDiffusionは、1台のRTX3060で2.39倍、1台のRTX4090で1.99倍のエネルギー消費削減を実現した。

English

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

StreamDiffusion: リアルタイムインタラクティブ生成のためのパイプライン全体のソリューション

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

要旨

Support