다단계 일관성 모델

초록

디퓨전 모델은 상대적으로 학습이 쉽지만 샘플 생성에 많은 단계가 필요합니다. 반면, 일관성 모델(Consistency Models)은 학습이 훨씬 어렵지만 단일 단계로 샘플을 생성할 수 있습니다. 본 논문에서는 **다단계 일관성 모델(Multistep Consistency Models)**을 제안합니다. 이는 일관성 모델(Song et al., 2023)과 TRACT(Berthelot et al., 2023)을 통합한 것으로, 일관성 모델과 디퓨전 모델 사이를 보간할 수 있습니다. 즉, 샘플링 속도와 샘플링 품질 간의 균형을 조절할 수 있습니다. 구체적으로, 1단계 일관성 모델은 기존의 일관성 모델과 동일하며, 무한대 단계(∞-step) 일관성 모델은 디퓨전 모델에 해당함을 보여줍니다. 다단계 일관성 모델은 실제로 매우 효과적으로 작동합니다. 샘플링 단계를 단일 단계에서 2~8단계로 늘림으로써, 더 높은 품질의 샘플을 생성하는 모델을 더 쉽게 학습할 수 있으며, 샘플링 속도 이점도 상당 부분 유지할 수 있습니다. 주목할 만한 결과로는, 일관성 증류(consistency distillation)를 통해 8단계에서 Imagenet 64에서 1.4 FID, Imagenet 128에서 2.1 FID를 달성했습니다. 또한, 본 방법이 텍스트-이미지 디퓨전 모델로 확장 가능하며, 원본 모델의 품질에 매우 근접한 샘플을 생성할 수 있음을 보여줍니다.

English

Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step. In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelot et al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas we show that a infty-step consistency model is a diffusion model. Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 step and 2.1 FID on Imagenet128 in 8 steps with consistency distillation. We also show that our method scales to a text-to-image diffusion model, generating samples that are very close to the quality of the original model.