StreamDiffusion: 실시간 상호작용 생성을 위한 파이프라인 수준 솔루션

초록

우리는 실시간 상호작용 이미지 생성을 위해 설계된 StreamDiffusion을 소개한다. 기존의 확산 모델들은 텍스트나 이미지 프롬프트로부터 이미지를 생성하는 데 능숙하지만, 실시간 상호작용에서는 종종 한계를 보인다. 이러한 한계는 메타버스, 라이브 비디오 스트리밍, 방송과 같이 연속적인 입력이 필요한 시나리오에서 특히 두드러지며, 이러한 상황에서는 높은 처리량이 필수적이다. 이를 해결하기 위해, 우리는 기존의 순차적 노이즈 제거 과정을 배치 노이즈 제거 프로세스로 변환하는 새로운 접근 방식을 제시한다. Stream Batch는 기존의 대기 후 상호작용 방식을 제거하고, 유연하고 높은 처리량의 스트림을 가능하게 한다. 데이터 입력과 모델 처리량 간의 주파수 차이를 처리하기 위해, 우리는 스트리밍 프로세스를 병렬화하기 위한 새로운 입력-출력 큐를 설계했다. 또한, 기존의 확산 파이프라인은 추가적인 U-Net 계산이 필요한 classifier-free guidance(CFG)를 사용한다. 이러한 중복 계산을 완화하기 위해, 우리는 부정 조건부 노이즈 제거 단계를 단 한 번 또는 심지어 제로로 줄이는 새로운 잔류 classifier-free guidance(RCFG) 알고리즘을 제안한다. 더불어, 전력 소비를 최적화하기 위해 확률적 유사성 필터(SSF)를 도입했다. 우리의 Stream Batch는 다양한 노이즈 제거 수준에서 순차적 노이즈 제거 방법 대비 약 1.5배의 속도 향상을 달성했다. 제안된 RCFG는 기존 CFG 대비 최대 2.05배 빠른 속도를 보였다. 제안된 전략과 기존의 성숙한 가속 도구를 결합하여, 하나의 RTX4090에서 이미지-이미지 생성이 최대 91.07fps를 달성하며, Diffusers에서 개발한 AutoPipeline의 처리량을 59.56배 이상 향상시켰다. 또한, 우리가 제안한 StreamDiffusion은 하나의 RTX3060에서 2.39배, 하나의 RTX4090에서 1.99배의 에너지 소비를 크게 줄였다.

English

We introduce StreamDiffusion, a real-time diffusion pipeline designed for interactive image generation. Existing diffusion models are adept at creating images from text or image prompts, yet they often fall short in real-time interaction. This limitation becomes particularly evident in scenarios involving continuous input, such as Metaverse, live video streaming, and broadcasting, where high throughput is imperative. To address this, we present a novel approach that transforms the original sequential denoising into the batching denoising process. Stream Batch eliminates the conventional wait-and-interact approach and enables fluid and high throughput streams. To handle the frequency disparity between data input and model throughput, we design a novel input-output queue for parallelizing the streaming process. Moreover, the existing diffusion pipeline uses classifier-free guidance(CFG), which requires additional U-Net computation. To mitigate the redundant computations, we propose a novel residual classifier-free guidance (RCFG) algorithm that reduces the number of negative conditional denoising steps to only one or even zero. Besides, we introduce a stochastic similarity filter(SSF) to optimize power consumption. Our Stream Batch achieves around 1.5x speedup compared to the sequential denoising method at different denoising levels. The proposed RCFG leads to speeds up to 2.05x higher than the conventional CFG. Combining the proposed strategies and existing mature acceleration tools makes the image-to-image generation achieve up-to 91.07fps on one RTX4090, improving the throughputs of AutoPipline developed by Diffusers over 59.56x. Furthermore, our proposed StreamDiffusion also significantly reduces the energy consumption by 2.39x on one RTX3060 and 1.99x on one RTX4090, respectively.

StreamDiffusion: 실시간 상호작용 생성을 위한 파이프라인 수준 솔루션

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

초록

Support