InstaFlow: 고품질 확산 기반 텍스트-이미지 생성을 위한 단일 단계면 충분하다

초록

디퓨전 모델은 뛰어난 품질과 창의성으로 텍스트-이미지 생성 분야에 혁신을 가져왔습니다. 그러나 다단계 샘플링 과정이 느린 것으로 알려져 있으며, 만족스러운 결과를 얻기 위해서는 종종 수십 번의 추론 단계가 필요합니다. 이전의 증류(distillation)를 통해 샘플링 속도를 개선하고 계산 비용을 줄이려는 시도들은 기능적인 원스텝 모델을 달성하는 데 실패했습니다. 본 논문에서는 지금까지 소규모 데이터셋에만 적용되었던 Rectified Flow라는 최근 방법을 탐구합니다. Rectified Flow의 핵심은 확률 흐름의 궤적을 직선화하고, 노이즈와 이미지 간의 결합을 개선하며, 학생 모델을 통한 증류 과정을 용이하게 하는 리플로우(reflow) 절차에 있습니다. 우리는 Stable Diffusion(SD)을 초고속 원스텝 모델로 변환하기 위한 새로운 텍스트 조건부 파이프라인을 제안하며, 이 과정에서 리플로우가 노이즈와 이미지 간의 할당을 개선하는 데 중요한 역할을 한다는 것을 발견했습니다. 이 새로운 파이프라인을 활용하여, 우리는 SD 수준의 이미지 품질을 가진 최초의 원스텝 디퓨전 기반 텍스트-이미지 생성기를 개발했습니다. 이는 MS COCO 2017-5k에서 23.3의 FID(Frechet Inception Distance)를 달성하여, 이전의 최신 기술인 점진적 증류(progressive distillation)를 상당한 차이로 능가했습니다(FID 37.2 → 23.3). 1.7B 파라미터로 확장된 네트워크를 활용하여 FID를 22.4로 더욱 개선했습니다. 우리는 이 원스텝 모델을 InstaFlow라고 명명했습니다. MS COCO 2014-30k에서 InstaFlow는 단 0.09초 만에 13.1의 FID를 기록하며, ≤0.1초 영역에서 최고의 성능을 보였고, 최근의 StyleGAN-T(0.1초에서 13.9)를 능가했습니다. 특히, InstaFlow의 훈련 비용은 단 199 A100 GPU 일에 불과합니다. 프로젝트 페이지: https://github.com/gnobitab/InstaFlow.

English

Diffusion models have revolutionized text-to-image generation with its exceptional quality and creativity. However, its multi-step sampling process is known to be slow, often requiring tens of inference steps to obtain satisfactory results. Previous attempts to improve its sampling speed and reduce computational costs through distillation have been unsuccessful in achieving a functional one-step model. In this paper, we explore a recent method called Rectified Flow, which, thus far, has only been applied to small datasets. The core of Rectified Flow lies in its reflow procedure, which straightens the trajectories of probability flows, refines the coupling between noises and images, and facilitates the distillation process with student models. We propose a novel text-conditioned pipeline to turn Stable Diffusion (SD) into an ultra-fast one-step model, in which we find reflow plays a critical role in improving the assignment between noise and images. Leveraging our new pipeline, we create, to the best of our knowledge, the first one-step diffusion-based text-to-image generator with SD-level image quality, achieving an FID (Frechet Inception Distance) of 23.3 on MS COCO 2017-5k, surpassing the previous state-of-the-art technique, progressive distillation, by a significant margin (37.2 rightarrow 23.3 in FID). By utilizing an expanded network with 1.7B parameters, we further improve the FID to 22.4. We call our one-step models InstaFlow. On MS COCO 2014-30k, InstaFlow yields an FID of 13.1 in just 0.09 second, the best in leq 0.1 second regime, outperforming the recent StyleGAN-T (13.9 in 0.1 second). Notably, the training of InstaFlow only costs 199 A100 GPU days. Project page:~https://github.com/gnobitab/InstaFlow.

InstaFlow: 고품질 확산 기반 텍스트-이미지 생성을 위한 단일 단계면 충분하다

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

초록

Support