병목 샘플링을 통한 학습 없이도 가능한 확산 모델 가속화

초록

디퓨전 모델은 시각적 콘텐츠 생성에서 뛰어난 능력을 보여왔지만, 추론 과정에서의 높은 계산 비용으로 인해 배포에 어려움을 겪고 있습니다. 이러한 계산 부담은 주로 이미지 또는 비디오 해상도에 대한 자기 주의(self-attention)의 이차 복잡성에서 비롯됩니다. 기존의 가속화 방법들은 종종 출력 품질을 저하시키거나 비용이 많이 드는 재학습을 필요로 하지만, 우리는 대부분의 디퓨전 모델이 낮은 해상도에서 사전 학습된다는 점을 관찰했습니다. 이는 성능 저하 없이 더 효율적인 추론을 위해 이러한 낮은 해상도 사전 지식을 활용할 수 있는 기회를 제공합니다. 본 연구에서는 이러한 낮은 해상도 사전 지식을 활용하여 계산 오버헤드를 줄이면서 출력 충실도를 유지하는 학습이 필요 없는 프레임워크인 Bottleneck Sampling을 소개합니다. Bottleneck Sampling은 높은-낮은-높은 디노이징 워크플로우를 따릅니다: 초기 및 최종 단계에서는 높은 해상도에서 디노이징을 수행하고, 중간 단계에서는 낮은 해상도에서 작동합니다. 앨리어싱과 블러링 아티팩트를 완화하기 위해, 우리는 해상도 전환 지점을 세밀하게 조정하고 각 단계에서 디노이징 타임스텝을 적응적으로 조정합니다. 우리는 Bottleneck Sampling을 이미지 및 비디오 생성 작업에서 평가하며, 광범위한 실험을 통해 이미지 생성에서는 최대 3배, 비디오 생성에서는 최대 2.5배의 추론 가속화를 달성하면서도 여러 평가 지표에서 표준 전체 해상도 샘플링 프로세스와 비슷한 출력 품질을 유지함을 입증했습니다. 코드는 https://github.com/tyfeld/Bottleneck-Sampling에서 확인할 수 있습니다.

English

Diffusion models have demonstrated remarkable capabilities in visual content generation but remain challenging to deploy due to their high computational cost during inference. This computational burden primarily arises from the quadratic complexity of self-attention with respect to image or video resolution. While existing acceleration methods often compromise output quality or necessitate costly retraining, we observe that most diffusion models are pre-trained at lower resolutions, presenting an opportunity to exploit these low-resolution priors for more efficient inference without degrading performance. In this work, we introduce Bottleneck Sampling, a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. Bottleneck Sampling follows a high-low-high denoising workflow: it performs high-resolution denoising in the initial and final stages while operating at lower resolutions in intermediate steps. To mitigate aliasing and blurring artifacts, we further refine the resolution transition points and adaptively shift the denoising timesteps at each stage. We evaluate Bottleneck Sampling on both image and video generation tasks, where extensive experiments demonstrate that it accelerates inference by up to 3times for image generation and 2.5times for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process across multiple evaluation metrics. Code is available at: https://github.com/tyfeld/Bottleneck-Sampling

병목 샘플링을 통한 학습 없이도 가능한 확산 모델 가속화

Training-free Diffusion Acceleration with Bottleneck Sampling

초록

Support